1
00:00:06,971 --> 00:00:10,106
Okay, so welcome to lecture two of CS231N.

2
00:00:10,106 --> 00:00:12,646
On Tuesday we, just recall,
we, sort of, gave you

3
00:00:12,646 --> 00:00:15,098
the big picture view of
what is computer vision,

4
00:00:15,098 --> 00:00:16,187
what is the history,

5
00:00:16,187 --> 00:00:18,101
and a little bit of the
overview of the class.

6
00:00:18,101 --> 00:00:20,340
And today, we're really going
to dive in, for the first time,

7
00:00:20,340 --> 00:00:21,451
into the details.

8
00:00:21,451 --> 00:00:23,339
And we'll start to see,
in much more depth,

9
00:00:23,339 --> 00:00:25,752
exactly how some of
these learning algorithms

10
00:00:25,752 --> 00:00:27,864
actually work in practice.

11
00:00:27,864 --> 00:00:29,089
So, the first lecture of the class

12
00:00:29,089 --> 00:00:31,927
is probably, sort of, the
largest big picture vision.

13
00:00:31,927 --> 00:00:33,808
And the majority of the
lectures in this class

14
00:00:33,808 --> 00:00:35,574
will be much more detail orientated,

15
00:00:35,574 --> 00:00:37,298
much more focused on
the specific mechanics,

16
00:00:37,298 --> 00:00:39,617
of these different algorithms.

17
00:00:39,617 --> 00:00:41,148
So, today we'll see our
first learning algorithm

18
00:00:41,148 --> 00:00:43,419
and that'll be really exciting, I think.

19
00:00:43,419 --> 00:00:44,822
But, before we get to that,

20
00:00:44,822 --> 00:00:47,535
I wanted to talk about a couple
of administrative issues.

21
00:00:47,535 --> 00:00:48,785
One, is Piazza.

22
00:00:49,645 --> 00:00:51,580
So, I saw it when I checked yesterday,

23
00:00:51,580 --> 00:00:54,168
it seemed like we had maybe 500 students

24
00:00:54,168 --> 00:00:55,357
signed up on Piazza.

25
00:00:55,357 --> 00:00:56,839
Which means that there
are several hundred of you

26
00:00:56,839 --> 00:00:58,493
who are not yet there.

27
00:00:58,493 --> 00:01:01,457
So, we really want Piazza
to be the main source

28
00:01:01,457 --> 00:01:04,221
of communication between the
students and the core staff.

29
00:01:04,221 --> 00:01:07,401
So, we've gotten a lot of
questions to the staff list

30
00:01:07,401 --> 00:01:11,170
about project ideas or questions
about midterm attendance

31
00:01:11,170 --> 00:01:12,841
or poster session attendance.

32
00:01:12,841 --> 00:01:14,389
And, any, sort of, questions like that

33
00:01:14,389 --> 00:01:16,138
should really go to Piazza.

34
00:01:16,138 --> 00:01:18,134
You'll probably get answers
to your questions faster

35
00:01:18,134 --> 00:01:21,568
on Piazza, because all the
TAs are knowing to check that.

36
00:01:21,568 --> 00:01:23,181
And it's, sort of, easy
for emails to get lost

37
00:01:23,181 --> 00:01:26,234
in the shuffle if you just
send to the course list.

38
00:01:26,234 --> 00:01:28,983
It's also come to my attention
that some SCPD students

39
00:01:28,983 --> 00:01:33,629
are having a bit of a hard
time signing up for Piazza.

40
00:01:33,629 --> 00:01:35,458
SCPD students are supposed to receive a

41
00:01:35,458 --> 00:01:37,791
@stanford.edu email address.

42
00:01:38,804 --> 00:01:40,733
So, once you get that email address,

43
00:01:40,733 --> 00:01:44,276
then you can use the Stanford
email to sign into Piazza.

44
00:01:44,276 --> 00:01:45,857
Probably that doesn't
affect those of you who are

45
00:01:45,857 --> 00:01:46,908
sitting in the room right now,

46
00:01:46,908 --> 00:01:50,408
but, for those students listening on SCPD.

47
00:01:52,191 --> 00:01:55,643
The next administrative issue
is about assignment one.

48
00:01:55,643 --> 00:01:58,178
Assignment one will be up later today,

49
00:01:58,178 --> 00:01:59,517
probably sometime this afternoon,

50
00:01:59,517 --> 00:02:01,653
but I promise, before
I go to sleep tonight,

51
00:02:01,653 --> 00:02:03,043
it'll be up.

52
00:02:03,043 --> 00:02:04,589
But, if you're getting a little bit antsy

53
00:02:04,589 --> 00:02:07,304
and really want to start
working on it right now,

54
00:02:07,304 --> 00:02:09,297
then you can look at last year's version

55
00:02:09,297 --> 00:02:10,531
of assignment one.

56
00:02:10,531 --> 00:02:12,684
It'll be pretty much the same content.

57
00:02:12,684 --> 00:02:15,081
We're just reshuffling it
a little bit to make it,

58
00:02:15,081 --> 00:02:17,169
like, for example, upgrading
to work with Python 3,

59
00:02:17,169 --> 00:02:19,082
rather than Python 2.7.

60
00:02:19,082 --> 00:02:20,943
And some of these minor cosmetic changes,

61
00:02:20,943 --> 00:02:23,068
but the content of the
assignment will still be the same

62
00:02:23,068 --> 00:02:24,742
as last year.

63
00:02:24,742 --> 00:02:27,332
So, in this assignment you'll
be implementing your own

64
00:02:27,332 --> 00:02:28,665
k-nearest neighbor classifier,

65
00:02:28,665 --> 00:02:30,686
which we're going to talk
about in this lecture.

66
00:02:30,686 --> 00:02:33,514
You'll also implement several
different linear classifiers,

67
00:02:33,514 --> 00:02:35,694
including the SVM and Softmax,

68
00:02:35,694 --> 00:02:37,927
as well as a simple
two-layer neural network.

69
00:02:37,927 --> 00:02:39,410
And we'll cover all this content

70
00:02:39,410 --> 00:02:42,160
over the next couple of lectures.

71
00:02:43,112 --> 00:02:46,168
So, all of our assignments
are using Python and NumPy.

72
00:02:46,168 --> 00:02:49,229
If you aren't familiar
with Python or NumPy,

73
00:02:49,229 --> 00:02:51,608
then we have written a
tutorial that you can find

74
00:02:51,608 --> 00:02:54,312
on the course website to
try and get you up to speed.

75
00:02:54,312 --> 00:02:56,856
But, this is, actually, pretty important.

76
00:02:56,856 --> 00:02:59,211
NumPy lets you write these
very efficient vectorized

77
00:02:59,211 --> 00:03:02,236
operations that let you do
quite a lot of computation

78
00:03:02,236 --> 00:03:03,856
in just a couple lines of code.

79
00:03:03,856 --> 00:03:06,081
So this is super important for pretty much

80
00:03:06,081 --> 00:03:09,087
all aspects of numerical
computing and machine learning

81
00:03:09,087 --> 00:03:10,288
and everything like that,

82
00:03:10,288 --> 00:03:13,424
is efficiently implementing
these vectorized operations.

83
00:03:13,424 --> 00:03:15,600
And you'll get a lot of practice with this

84
00:03:15,600 --> 00:03:16,964
on the first assignment.

85
00:03:16,964 --> 00:03:19,433
So, for those of you who
don't have a lot of experience

86
00:03:19,433 --> 00:03:23,449
with Matlab or NumPy or
other types of vectorized

87
00:03:23,449 --> 00:03:26,407
tensor computation, I recommend
that you start looking

88
00:03:26,407 --> 00:03:27,606
at this assignment pretty early

89
00:03:27,606 --> 00:03:32,175
and also, read carefully
through the tutorial.

90
00:03:32,175 --> 00:03:34,255
The other thing I wanted to talk about

91
00:03:34,255 --> 00:03:36,287
is that we're happy to announce that

92
00:03:36,287 --> 00:03:38,867
we're officially supported
through Google Cloud

93
00:03:38,867 --> 00:03:40,198
for this class.

94
00:03:40,198 --> 00:03:43,630
So, Google Cloud is somewhat
similar to Amazon AWS.

95
00:03:43,630 --> 00:03:46,694
You can go and start virtual
machines up in the cloud.

96
00:03:46,694 --> 00:03:50,432
These virtual machines can have GPUs.

97
00:03:50,432 --> 00:03:53,197
We're working on the tutorial
for exactly how to use

98
00:03:53,197 --> 00:03:55,386
Google Cloud and get it to
work for the assignments.

99
00:03:55,386 --> 00:03:58,517
But our intention is that
you'll be able to just download

100
00:03:58,517 --> 00:04:00,776
some image, and it'll be very seamless

101
00:04:00,776 --> 00:04:02,366
for you to work on the assignments

102
00:04:02,366 --> 00:04:04,723
on one of these instances on the cloud.

103
00:04:04,723 --> 00:04:07,086
And because Google has, very generously,

104
00:04:07,086 --> 00:04:08,437
supported this course,

105
00:04:08,437 --> 00:04:10,414
we'll be able to distribute to each of you

106
00:04:10,414 --> 00:04:14,122
coupons that let you use
Google Cloud credits for free

107
00:04:14,122 --> 00:04:15,582
for the class.

108
00:04:15,582 --> 00:04:18,571
So you can feel free to use
these for the assignments

109
00:04:18,571 --> 00:04:20,084
and also for the course projects

110
00:04:20,084 --> 00:04:22,607
when you want to start using
GPUs and larger machines

111
00:04:22,607 --> 00:04:24,015
and whatnot.

112
00:04:24,015 --> 00:04:25,976
So, we'll post more details about that,

113
00:04:25,976 --> 00:04:28,023
probably, on Piazza later today.

114
00:04:28,023 --> 00:04:29,300
But, I just wanted to mention,

115
00:04:29,300 --> 00:04:31,241
because I know there had
been a couple of questions

116
00:04:31,241 --> 00:04:33,369
about, can I use my laptop?

117
00:04:33,369 --> 00:04:34,389
Do I have to run on corn?

118
00:04:34,389 --> 00:04:35,566
Do I have to, whatever?

119
00:04:35,566 --> 00:04:37,607
And the answer is that,
you'll be able to run on

120
00:04:37,607 --> 00:04:41,774
Google Cloud and we'll provide
you some coupons for that.

121
00:04:43,923 --> 00:04:44,756
Yeah, so,

122
00:04:45,959 --> 00:04:47,818
those are, kind of, the
major administrative issues

123
00:04:47,818 --> 00:04:49,716
I wanted to talk about today.

124
00:04:49,716 --> 00:04:53,681
And then, let's dive into the content.

125
00:04:53,681 --> 00:04:55,463
So, the last lecture
we talked a little bit

126
00:04:55,463 --> 00:04:58,125
about this task of image classification,

127
00:04:58,125 --> 00:05:00,450
which is really a core
task in computer vision.

128
00:05:00,450 --> 00:05:02,444
And this is something
that we'll really focus on

129
00:05:02,444 --> 00:05:04,048
throughout the course of the class.

130
00:05:04,048 --> 00:05:04,964
Is, exactly,

131
00:05:04,964 --> 00:05:07,832
how do we work on this
image classification task?

132
00:05:07,832 --> 00:05:09,750
So, a little bit more concretely,

133
00:05:09,750 --> 00:05:12,142
when you're doing image classification,

134
00:05:12,142 --> 00:05:14,259
your system receives some input image,

135
00:05:14,259 --> 00:05:16,457
which is this cute cat in this example,

136
00:05:16,457 --> 00:05:20,007
and the system is aware
of some predetermined set

137
00:05:20,007 --> 00:05:22,455
of categories or labels.

138
00:05:22,455 --> 00:05:25,465
So, these might be, like,
a dog or a cat or a truck

139
00:05:25,465 --> 00:05:29,134
or a plane, and there's some
fixed set of category labels,

140
00:05:29,134 --> 00:05:31,292
and the job of the computer
is to look at the picture

141
00:05:31,292 --> 00:05:34,831
and assign it one of these
fixed category labels.

142
00:05:34,831 --> 00:05:36,444
This seems like a really easy problem,

143
00:05:36,444 --> 00:05:40,268
because so much of your own
visual system in your brain

144
00:05:40,268 --> 00:05:42,480
is hardwired to doing these, sort of,

145
00:05:42,480 --> 00:05:44,346
visual recognition tasks.

146
00:05:44,346 --> 00:05:46,336
But this is actually a
really, really hard problem

147
00:05:46,336 --> 00:05:48,232
for a machine.

148
00:05:48,232 --> 00:05:50,306
So, if you dig in and
think about, actually,

149
00:05:50,306 --> 00:05:53,236
what does a computer see
when it looks at this image,

150
00:05:53,236 --> 00:05:55,852
it definitely doesn't get
this holistic idea of a cat

151
00:05:55,852 --> 00:05:57,428
that you see when you look at it.

152
00:05:57,428 --> 00:05:59,418
And the computer really
is representing the image

153
00:05:59,418 --> 00:06:01,755
as this gigantic grid of numbers.

154
00:06:01,755 --> 00:06:05,922
So, the image might be something
like 800 by 600 pixels.

155
00:06:07,371 --> 00:06:10,135
And each pixel is
represented by three numbers,

156
00:06:10,135 --> 00:06:13,176
giving the red, green, and
blue values for that pixel.

157
00:06:13,176 --> 00:06:14,106
So, to the computer,

158
00:06:14,106 --> 00:06:15,769
this is just a gigantic grid of numbers.

159
00:06:15,769 --> 00:06:18,569
And it's very difficult
to distill the cat-ness

160
00:06:18,569 --> 00:06:22,508
out of this, like, giant array
of thousands, or whatever,

161
00:06:22,508 --> 00:06:24,841
very many different numbers.

162
00:06:26,956 --> 00:06:30,186
So, we refer to this
problem as the semantic gap.

163
00:06:30,186 --> 00:06:32,735
This idea of a cat, or
this label of a cat,

164
00:06:32,735 --> 00:06:35,462
is a semantic label that
we're assigning to this image,

165
00:06:35,462 --> 00:06:38,771
and there's this huge gap between
the semantic idea of a cat

166
00:06:38,771 --> 00:06:42,897
and these pixel values that the
computer is actually seeing.

167
00:06:42,897 --> 00:06:45,503
And this is a really hard problem because

168
00:06:45,503 --> 00:06:48,901
you can change the picture
in very small, subtle ways

169
00:06:48,901 --> 00:06:51,916
that will cause this pixel
grid to change entirely.

170
00:06:51,916 --> 00:06:54,038
So, for example, if we took this same cat,

171
00:06:54,038 --> 00:06:55,622
and if the cat happened to sit still

172
00:06:55,622 --> 00:06:57,352
and not even twitch, not move a muscle,

173
00:06:57,352 --> 00:06:58,782
which is never going to happen,

174
00:06:58,782 --> 00:07:01,077
but we moved the camera to the other side,

175
00:07:01,077 --> 00:07:04,050
then every single grid,
every single pixel,

176
00:07:04,050 --> 00:07:05,620
in this giant grid of numbers

177
00:07:05,620 --> 00:07:06,921
would be completely different.

178
00:07:06,921 --> 00:07:09,431
But, somehow, it's still
representing the same cat.

179
00:07:09,431 --> 00:07:12,754
And our algorithms need
to be robust to this.

180
00:07:12,754 --> 00:07:15,131
But, not only viewpoint is one problem,

181
00:07:15,131 --> 00:07:16,260
another is illumination.

182
00:07:16,260 --> 00:07:18,124
There can be different
lighting conditions going on

183
00:07:18,124 --> 00:07:19,298
in the scene.

184
00:07:19,298 --> 00:07:22,500
Whether the cat is appearing
in this very dark, moody scene,

185
00:07:22,500 --> 00:07:25,520
or like is this very bright,
sunlit scene, it's still a cat,

186
00:07:25,520 --> 00:07:28,488
and our algorithms need
to be robust to that.

187
00:07:28,488 --> 00:07:30,145
Objects can also deform.

188
00:07:30,145 --> 00:07:32,355
I think cats are, maybe,
among the more deformable

189
00:07:32,355 --> 00:07:34,549
of animals that you might see out there.

190
00:07:34,549 --> 00:07:37,825
And cats can really assume a
lot of different, varied poses

191
00:07:37,825 --> 00:07:38,940
and positions.

192
00:07:38,940 --> 00:07:40,774
And our algorithms should
be robust to these different

193
00:07:40,774 --> 00:07:42,441
kinds of transforms.

194
00:07:43,442 --> 00:07:45,823
There can also be problems of occlusion,

195
00:07:45,823 --> 00:07:49,762
where you might only see part
of a cat, like, just the face,

196
00:07:49,762 --> 00:07:51,865
or in this extreme example,
just a tail peeking out

197
00:07:51,865 --> 00:07:53,686
from under the couch cushion.

198
00:07:53,686 --> 00:07:56,626
But, in these cases, it's pretty
easy for you, as a person,

199
00:07:56,626 --> 00:07:58,897
to realize that this is probably a cat,

200
00:07:58,897 --> 00:08:01,532
and you still recognize
these images as cats.

201
00:08:01,532 --> 00:08:03,929
And this is something that our algorithms

202
00:08:03,929 --> 00:08:05,659
also must be robust to,

203
00:08:05,659 --> 00:08:08,460
which is quite difficult, I think.

204
00:08:08,460 --> 00:08:10,792
There can also be problems
of background clutter,

205
00:08:10,792 --> 00:08:13,329
where maybe the foreground
object of the cat,

206
00:08:13,329 --> 00:08:15,582
could actually look quite
similar in appearance

207
00:08:15,582 --> 00:08:16,677
to the background.

208
00:08:16,677 --> 00:08:20,389
And this is another thing
that we need to handle.

209
00:08:20,389 --> 00:08:23,637
There's also this problem
of intraclass variation,

210
00:08:23,637 --> 00:08:26,776
that this one notion of
cat-ness, actually spans a lot of

211
00:08:26,776 --> 00:08:28,455
different visual appearances.

212
00:08:28,455 --> 00:08:30,340
And cats can come in
different shapes and sizes

213
00:08:30,340 --> 00:08:32,158
and colors and ages.

214
00:08:32,158 --> 00:08:34,154
And our algorithm, again, needs to work

215
00:08:34,154 --> 00:08:36,345
and handle all these different variations.

216
00:08:36,345 --> 00:08:40,033
So, this is actually a really,
really challenging problem.

217
00:08:40,033 --> 00:08:43,402
And it's sort of easy to
forget how easy this is

218
00:08:43,402 --> 00:08:46,264
because so much of your
brain is specifically tuned

219
00:08:46,264 --> 00:08:47,931
for dealing with these things.

220
00:08:47,931 --> 00:08:49,604
But now if we want our computer programs

221
00:08:49,604 --> 00:08:52,590
to deal with all of these
problems, all simultaneously,

222
00:08:52,590 --> 00:08:54,288
and not just for cats, by the way,

223
00:08:54,288 --> 00:08:56,832
but for just about any object
category you can imagine,

224
00:08:56,832 --> 00:08:59,052
this is a fantastically
challenging problem.

225
00:08:59,052 --> 00:09:00,686
And it's, actually, somewhat miraculous

226
00:09:00,686 --> 00:09:03,078
that this works at all, in my opinion.

227
00:09:03,078 --> 00:09:04,703
But, actually, not only does it work,

228
00:09:04,703 --> 00:09:07,285
but these things work very
close to human accuracy

229
00:09:07,285 --> 00:09:09,278
in some limited situations.

230
00:09:09,278 --> 00:09:12,379
And take only hundreds
of milliseconds to do so.

231
00:09:12,379 --> 00:09:14,921
So, this is some pretty
amazing, incredible technology,

232
00:09:14,921 --> 00:09:18,418
in my opinion, and over the
course of the rest of the class

233
00:09:18,418 --> 00:09:20,241
we will really see what
kinds of advancements

234
00:09:20,241 --> 00:09:22,241
have made this possible.

235
00:09:23,492 --> 00:09:24,728
So now, if you, kind of, think about

236
00:09:24,728 --> 00:09:27,565
what is the API for writing
an image classifier,

237
00:09:27,565 --> 00:09:30,098
you might sit down and try
to write a method in Python

238
00:09:30,098 --> 00:09:31,131
like this.

239
00:09:31,131 --> 00:09:32,739
Where you want to take in an image

240
00:09:32,739 --> 00:09:34,401
and then do some crazy magic

241
00:09:34,401 --> 00:09:36,185
and then, eventually,
spit out this class label

242
00:09:36,185 --> 00:09:38,180
to say cat or dog or whatnot.

243
00:09:38,180 --> 00:09:41,695
And there's really no obvious
way to do this, right?

244
00:09:41,695 --> 00:09:43,476
If you're taking an algorithms class

245
00:09:43,476 --> 00:09:44,981
and your task is to sort numbers

246
00:09:44,981 --> 00:09:46,837
or compute a convex hull

247
00:09:46,837 --> 00:09:49,535
or, even, do something
like RSA encryption,

248
00:09:49,535 --> 00:09:51,584
you, sort of, can write down an algorithm

249
00:09:51,584 --> 00:09:53,725
and enumerate all the
steps that need to happen

250
00:09:53,725 --> 00:09:55,773
in order for this things to work.

251
00:09:55,773 --> 00:09:58,325
But, when we're trying
to recognize objects,

252
00:09:58,325 --> 00:10:00,390
or recognize cats or images,

253
00:10:00,390 --> 00:10:02,838
there's no really clear,
explicit algorithm

254
00:10:02,838 --> 00:10:04,810
that makes intuitive sense,

255
00:10:04,810 --> 00:10:07,909
for how you might go about
recognizing these objects.

256
00:10:07,909 --> 00:10:09,874
So, this is, again, quite challenging,

257
00:10:09,874 --> 00:10:12,163
if you think about,

258
00:10:12,163 --> 00:10:13,447
if it was your first day programming

259
00:10:13,447 --> 00:10:15,464
and you had to sit down
and write this function,

260
00:10:15,464 --> 00:10:18,869
I think most people would be in trouble.

261
00:10:18,869 --> 00:10:19,740
That being said,

262
00:10:19,740 --> 00:10:21,907
people have definitely
made explicit attempts

263
00:10:21,907 --> 00:10:25,029
to try to write, sort
of, high-end coded rules

264
00:10:25,029 --> 00:10:27,004
for recognizing different animals.

265
00:10:27,004 --> 00:10:29,193
So, we touched on this a
little bit in the last lecture,

266
00:10:29,193 --> 00:10:32,095
but maybe one idea for cats is that,

267
00:10:32,095 --> 00:10:35,596
we know that cats have ears
and eyes and mouths and noses.

268
00:10:35,596 --> 00:10:37,945
And we know that edges,
from Hubel and Wiesel,

269
00:10:37,945 --> 00:10:39,512
we know that edges are pretty important

270
00:10:39,512 --> 00:10:41,641
when it comes to visual recognition.

271
00:10:41,641 --> 00:10:43,820
So one thing we might try to do is

272
00:10:43,820 --> 00:10:45,425
compute the edges of this image

273
00:10:45,425 --> 00:10:47,543
and then go in and try to
categorize all the different

274
00:10:47,543 --> 00:10:50,983
corners and boundaries, and
say that, if we have maybe

275
00:10:50,983 --> 00:10:53,146
three lines meeting this way,
then it might be a corner,

276
00:10:53,146 --> 00:10:55,186
and an ear has one corner
here and one corner there

277
00:10:55,186 --> 00:10:56,483
and one corner there,

278
00:10:56,483 --> 00:10:59,076
and then, kind of, write down
this explicit set of rules

279
00:10:59,076 --> 00:11:01,608
for recognizing cats.

280
00:11:01,608 --> 00:11:04,391
But this turns out not to work very well.

281
00:11:04,391 --> 00:11:06,127
One, it's super brittle.

282
00:11:06,127 --> 00:11:09,013
And, two, say, if you want
to start over for another

283
00:11:09,013 --> 00:11:11,876
object category, and maybe
not worry about cats,

284
00:11:11,876 --> 00:11:15,005
but talk about trucks or dogs
or fishes or something else,

285
00:11:15,005 --> 00:11:17,081
then you need to start all over again.

286
00:11:17,081 --> 00:11:19,853
So, this is really not a
very scalable approach.

287
00:11:19,853 --> 00:11:22,579
We want to come up with some
algorithm, or some method,

288
00:11:22,579 --> 00:11:25,181
for these recognition tasks

289
00:11:25,181 --> 00:11:27,360
which scales much more
naturally to all the variety

290
00:11:27,360 --> 00:11:29,360
of objects in the world.

291
00:11:31,311 --> 00:11:34,766
So, the insight that, sort
of, makes this all work

292
00:11:34,766 --> 00:11:38,092
is this idea of the data-driven approach.

293
00:11:38,092 --> 00:11:41,046
Rather than sitting down and
writing these hand-specified

294
00:11:41,046 --> 00:11:44,340
rules to try to craft exactly
what is a cat or a fish

295
00:11:44,340 --> 00:11:45,766
or what have you,

296
00:11:45,766 --> 00:11:47,845
instead, we'll go out onto the internet

297
00:11:47,845 --> 00:11:51,313
and collect a large
dataset of many, many cats

298
00:11:51,313 --> 00:11:53,524
and many, many airplanes
and many, many deer

299
00:11:53,524 --> 00:11:55,402
and different things like this.

300
00:11:55,402 --> 00:11:58,075
And we can actually use tools
like Google Image Search,

301
00:11:58,075 --> 00:11:59,138
or something like that,

302
00:11:59,138 --> 00:12:01,546
to go out and collect a very
large number of examples

303
00:12:01,546 --> 00:12:03,866
of these different categories.

304
00:12:03,866 --> 00:12:06,203
By the way, this actually
takes quite a lot of effort

305
00:12:06,203 --> 00:12:08,338
to go out and actually
collect these datasets

306
00:12:08,338 --> 00:12:10,873
but, luckily, there's a lot
of really good, high quality

307
00:12:10,873 --> 00:12:14,105
datasets out there already for you to use.

308
00:12:14,105 --> 00:12:16,073
Then once we get this dataset,

309
00:12:16,073 --> 00:12:18,536
we train this machine learning classifier

310
00:12:18,536 --> 00:12:20,869
that is going to ingest all of the data,

311
00:12:20,869 --> 00:12:22,339
summarize it in some way,

312
00:12:22,339 --> 00:12:23,733
and then spit out a model

313
00:12:23,733 --> 00:12:26,130
that summarizes the
knowledge of how to recognize

314
00:12:26,130 --> 00:12:28,446
these different object categories.

315
00:12:28,446 --> 00:12:30,134
Then finally, we'll
use this training model

316
00:12:30,134 --> 00:12:31,756
and apply it on new images

317
00:12:31,756 --> 00:12:33,601
that will then be able to recognize

318
00:12:33,601 --> 00:12:35,930
cats and dogs and whatnot.

319
00:12:35,930 --> 00:12:38,428
So here our API has changed a little bit.

320
00:12:38,428 --> 00:12:39,446
Rather than a single function

321
00:12:39,446 --> 00:12:42,099
that just inputs an image
and recognizes a cat,

322
00:12:42,099 --> 00:12:43,621
we have these two functions.

323
00:12:43,621 --> 00:12:48,084
One, called, train, that's
going to input images and labels

324
00:12:48,084 --> 00:12:49,115
and then output a model,

325
00:12:49,115 --> 00:12:52,030
and then, separately, another
function called, predict,

326
00:12:52,030 --> 00:12:54,013
which will input the model
and than make predictions

327
00:12:54,013 --> 00:12:55,276
for images.

328
00:12:55,276 --> 00:12:56,780
And this is, kind of, the key insight

329
00:12:56,780 --> 00:12:59,178
that allowed all these things
to start working really well

330
00:12:59,178 --> 00:13:01,928
over the last 10, 20 years or so.

331
00:13:05,784 --> 00:13:08,096
So, this class is primarily
about neural networks

332
00:13:08,096 --> 00:13:09,572
and convolutional neural networks

333
00:13:09,572 --> 00:13:11,111
and deep learning and all that,

334
00:13:11,111 --> 00:13:14,446
but this idea of a data-driven
approach is much more general

335
00:13:14,446 --> 00:13:15,819
than just deep learning.

336
00:13:15,819 --> 00:13:17,832
And I think it's useful to, sort of,

337
00:13:17,832 --> 00:13:19,145
step through this process

338
00:13:19,145 --> 00:13:20,924
for a very simple classifier first,

339
00:13:20,924 --> 00:13:23,058
before we get to these big, complex ones.

340
00:13:23,058 --> 00:13:26,898
So, probably, the simplest
classifier you can imagine

341
00:13:26,898 --> 00:13:28,907
is something we call nearest neighbor.

342
00:13:28,907 --> 00:13:31,243
The algorithm is pretty dumb, honestly.

343
00:13:31,243 --> 00:13:34,315
So, during the training
step we won't do anything,

344
00:13:34,315 --> 00:13:36,966
we'll just memorize all
of the training data.

345
00:13:36,966 --> 00:13:39,108
So this is very simple.

346
00:13:39,108 --> 00:13:41,006
And now, during the prediction step,

347
00:13:41,006 --> 00:13:43,191
we're going to take some new image

348
00:13:43,191 --> 00:13:45,597
and go and try to find
the most similar image

349
00:13:45,597 --> 00:13:47,964
in the training data to that new image,

350
00:13:47,964 --> 00:13:51,453
and now predict the label
of that most similar image.

351
00:13:51,453 --> 00:13:53,169
A very simple algorithm.

352
00:13:53,169 --> 00:13:55,104
But it, sort of, has a lot
of these nice properties

353
00:13:55,104 --> 00:13:58,771
with respect to
data-drivenness and whatnot.

354
00:13:59,977 --> 00:14:01,680
So, to be a little bit more concrete,

355
00:14:01,680 --> 00:14:04,951
you might imagine working on
this dataset called CIFAR-10,

356
00:14:04,951 --> 00:14:07,331
which is very commonly
used in machine learning,

357
00:14:07,331 --> 00:14:09,259
as kind of a small test case.

358
00:14:09,259 --> 00:14:11,579
And you'll be working with
this dataset on your homework.

359
00:14:11,579 --> 00:14:15,159
So, the CIFAR-10 dataset gives
you 10 different classes,

360
00:14:15,159 --> 00:14:17,965
airplanes and automobiles and
birds and cats and different

361
00:14:17,965 --> 00:14:19,920
things like that.

362
00:14:19,920 --> 00:14:22,073
And for each of those 10 categories

363
00:14:22,073 --> 00:14:24,990
it provides 50,000 training images,

364
00:14:27,173 --> 00:14:30,250
roughly evenly distributed
across these 10 categories.

365
00:14:30,250 --> 00:14:33,648
And then 10,000 additional testing images

366
00:14:33,648 --> 00:14:37,565
that you're supposed to
test your algorithm on.

367
00:14:38,707 --> 00:14:41,265
So here's an example
of applying this simple

368
00:14:41,265 --> 00:14:44,003
nearest neighbor classifier
to some of these test images

369
00:14:44,003 --> 00:14:45,741
on CIFAR-10.

370
00:14:45,741 --> 00:14:48,703
So, on this grid on the right,

371
00:14:48,703 --> 00:14:50,863
for the left most column,

372
00:14:50,863 --> 00:14:53,693
gives a test image in
the CIFAR-10 dataset.

373
00:14:53,693 --> 00:14:58,375
And now on the right, we've
sorted the training images

374
00:14:58,375 --> 00:15:01,328
and show the most similar training images

375
00:15:01,328 --> 00:15:03,571
to each of these test examples.

376
00:15:03,571 --> 00:15:05,878
And you can see that they
look kind of visually similar

377
00:15:05,878 --> 00:15:07,938
to the training images,

378
00:15:07,938 --> 00:15:10,974
although they are not
always correct, right?

379
00:15:10,974 --> 00:15:13,407
So, maybe on the second row,
we see that the testing,

380
00:15:13,407 --> 00:15:14,687
this is kind of hard to see,

381
00:15:14,687 --> 00:15:17,340
because these images are 32 by 32 pixels,

382
00:15:17,340 --> 00:15:18,768
you need to really dive in there

383
00:15:18,768 --> 00:15:21,224
and try to make your best guess.

384
00:15:21,224 --> 00:15:23,850
But, this image is a dog and
it's nearest neighbor is also

385
00:15:23,850 --> 00:15:28,072
a dog, but this next one,
I think is actually a deer

386
00:15:28,072 --> 00:15:30,006
or a horse or something else.

387
00:15:30,006 --> 00:15:33,237
But, you can see that it
looks quite visually similar,

388
00:15:33,237 --> 00:15:34,773
because there's kind of a
white blob in the middle

389
00:15:34,773 --> 00:15:36,370
and whatnot.

390
00:15:36,370 --> 00:15:38,630
So, if we're applying the
nearest neighbor algorithm

391
00:15:38,630 --> 00:15:39,651
to this image,

392
00:15:39,651 --> 00:15:42,707
we'll find the closest
example in the training set.

393
00:15:42,707 --> 00:15:45,107
And now, the closest
example, we know it's label,

394
00:15:45,107 --> 00:15:47,135
because it comes from the training set.

395
00:15:47,135 --> 00:15:49,735
And now, we'll simply say that
this testing image is also

396
00:15:49,735 --> 00:15:50,875
a dog.

397
00:15:50,875 --> 00:15:53,946
You can see from these
examples that is probably not

398
00:15:53,946 --> 00:15:55,851
going to work very well,

399
00:15:55,851 --> 00:16:00,018
but it's still kind of a
nice example to work through.

400
00:16:00,939 --> 00:16:03,724
But then, one detail
that we need to know is,

401
00:16:03,724 --> 00:16:05,013
given a pair of images,

402
00:16:05,013 --> 00:16:06,908
how can we actually compare them?

403
00:16:06,908 --> 00:16:09,152
Because, if we're going to take
our test image and compare it

404
00:16:09,152 --> 00:16:10,573
to all the training images,

405
00:16:10,573 --> 00:16:12,165
we actually have many different choices

406
00:16:12,165 --> 00:16:15,640
for exactly what that comparison
function should look like.

407
00:16:15,640 --> 00:16:17,687
So, in the example in the previous slide,

408
00:16:17,687 --> 00:16:20,008
we've used what's called the L1 distance,

409
00:16:20,008 --> 00:16:22,547
also sometimes called
the Manhattan distance.

410
00:16:22,547 --> 00:16:25,725
So, this is a really
sort of simple, easy idea

411
00:16:25,725 --> 00:16:27,448
for comparing images.

412
00:16:27,448 --> 00:16:31,691
And that's that we're going to
just compare individual pixels

413
00:16:31,691 --> 00:16:32,969
in these images.

414
00:16:32,969 --> 00:16:36,519
So, supposing that our test
image is maybe just a tiny

415
00:16:36,519 --> 00:16:39,346
four by four image of pixel values,

416
00:16:39,346 --> 00:16:41,802
then we're take this upper-left hand pixel

417
00:16:41,802 --> 00:16:43,172
of the test image,

418
00:16:43,172 --> 00:16:45,077
subtract off the value
in the training image,

419
00:16:45,077 --> 00:16:46,242
take the absolute value,

420
00:16:46,242 --> 00:16:49,068
and get the difference in that
pixel between the two images.

421
00:16:49,068 --> 00:16:50,916
And then, sum all these
up across all the pixels

422
00:16:50,916 --> 00:16:51,950
in the image.

423
00:16:51,950 --> 00:16:54,213
So, this is kind of a stupid
way to compare images,

424
00:16:54,213 --> 00:16:57,963
but it does some reasonable
things sometimes.

425
00:16:57,963 --> 00:16:59,535
But, this gives us a very concrete way

426
00:16:59,535 --> 00:17:01,991
to measure the difference
between two images.

427
00:17:01,991 --> 00:17:05,064
And in this case, we have
this difference of 456

428
00:17:05,064 --> 00:17:07,147
between these two images.

429
00:17:08,446 --> 00:17:10,944
So, here's some full Python code

430
00:17:10,944 --> 00:17:13,233
for implementing this
nearest neighbor classifier

431
00:17:13,233 --> 00:17:16,582
and you can see it's pretty
short and pretty concise

432
00:17:16,583 --> 00:17:18,804
because we've made use of
many of these vectorized

433
00:17:18,804 --> 00:17:21,348
operations offered by NumPy.

434
00:17:21,348 --> 00:17:25,078
So, here we can see that
this training function,

435
00:17:25,079 --> 00:17:26,247
that we talked about earlier,

436
00:17:26,247 --> 00:17:28,946
is, again, very simple, in
the case of nearest neighbor,

437
00:17:28,946 --> 00:17:30,573
you just memorize the training data,

438
00:17:30,573 --> 00:17:33,427
there's not really much to do here.

439
00:17:33,427 --> 00:17:35,869
And now, at test time, we're
going to take in our image

440
00:17:35,869 --> 00:17:39,126
and then go in and compare
using this L1 distance function,

441
00:17:39,126 --> 00:17:41,984
our test image to each of
these training examples

442
00:17:41,984 --> 00:17:45,395
and find the most similar
example in the training set.

443
00:17:45,395 --> 00:17:47,371
And you can see that, we're
actually able to do this

444
00:17:47,371 --> 00:17:50,113
in just one or two lines of Python code

445
00:17:50,113 --> 00:17:53,882
by utilizing these vectorized
operations in NumPy.

446
00:17:53,882 --> 00:17:55,779
So, this is something that
you'll get practice with

447
00:17:55,779 --> 00:17:57,779
on the first assignment.

448
00:17:58,628 --> 00:18:02,179
So now, a couple questions
about this simple classifier.

449
00:18:02,179 --> 00:18:04,910
First, if we have N examples
in our training set,

450
00:18:04,910 --> 00:18:09,077
then how fast can we expect
training and testing to be?

451
00:18:12,233 --> 00:18:13,998
Well, training is probably constant

452
00:18:13,998 --> 00:18:15,712
because we don't really
need to do anything,

453
00:18:15,712 --> 00:18:17,878
we just need to memorize the data.

454
00:18:17,878 --> 00:18:19,241
And if you're just copying a pointer,

455
00:18:19,241 --> 00:18:20,663
that's going to be constant time

456
00:18:20,663 --> 00:18:22,703
no matter how big your dataset is.

457
00:18:22,703 --> 00:18:26,209
But now, at test time we need
to do this comparison stop

458
00:18:26,209 --> 00:18:27,679
and compare our test image

459
00:18:27,679 --> 00:18:31,099
to each of the N training
examples in the dataset.

460
00:18:31,099 --> 00:18:33,766
And this is actually quite slow.

461
00:18:34,991 --> 00:18:37,448
So, this is actually somewhat backwards,

462
00:18:37,448 --> 00:18:38,641
if you think about it.

463
00:18:38,641 --> 00:18:40,350
Because, in practice,

464
00:18:40,350 --> 00:18:43,489
we want our classifiers to
be slow at training time

465
00:18:43,489 --> 00:18:45,326
and then fast at testing time.

466
00:18:45,326 --> 00:18:47,858
Because, you might imagine,
that a classifier might go

467
00:18:47,858 --> 00:18:49,882
and be trained in a data center somewhere

468
00:18:49,882 --> 00:18:52,045
and you can afford to
spend a lot of computation

469
00:18:52,045 --> 00:18:54,640
at training time to make
the classifier really good.

470
00:18:54,640 --> 00:18:55,473
But then,

471
00:18:55,473 --> 00:18:57,566
when you go and deploy the
classifier at test time,

472
00:18:57,566 --> 00:18:59,503
you want it to run on your mobile phone

473
00:18:59,503 --> 00:19:02,248
or in a browser or some
other low power device,

474
00:19:02,248 --> 00:19:04,493
and you really want the
testing time performance

475
00:19:04,493 --> 00:19:07,075
of your classifier to be quite fast.

476
00:19:07,075 --> 00:19:09,929
So, from this perspective, this
nearest neighbor algorithm,

477
00:19:09,929 --> 00:19:11,826
is, actually, a little bit backwards.

478
00:19:11,826 --> 00:19:13,591
And we'll see that once we move to

479
00:19:13,591 --> 00:19:14,720
convolutional neural networks,

480
00:19:14,720 --> 00:19:16,420
and other types of parametric models,

481
00:19:16,420 --> 00:19:18,286
they'll be the reverse of this.

482
00:19:18,286 --> 00:19:20,226
Where you'll spend a lot of
compute at training time,

483
00:19:20,226 --> 00:19:24,936
but then they'll be quite
fast at testing time.

484
00:19:24,936 --> 00:19:25,969
So then, the question is,

485
00:19:25,969 --> 00:19:28,139
what exactly does this
nearest neighbor algorithm

486
00:19:28,139 --> 00:19:30,816
look like when you apply it in practice?

487
00:19:30,816 --> 00:19:34,049
So, here we've drawn, what
we call the decision regions

488
00:19:34,049 --> 00:19:36,130
of a nearest neighbor classifier.

489
00:19:36,130 --> 00:19:39,636
So, here our training set
consists of these points

490
00:19:39,636 --> 00:19:42,021
in the two dimensional plane,

491
00:19:42,021 --> 00:19:45,388
where the color of the point
represents the category,

492
00:19:45,388 --> 00:19:47,547
or the class label, of that point.

493
00:19:47,547 --> 00:19:49,221
So, here we see we have five classes

494
00:19:49,221 --> 00:19:51,543
and some blue ones up in the corner here,

495
00:19:51,543 --> 00:19:53,921
some purple ones in the
upper-right hand corner.

496
00:19:53,921 --> 00:19:56,491
And now for each pixel
in this entire plane,

497
00:19:56,491 --> 00:20:00,860
we've gone and computed
what is the nearest example

498
00:20:00,860 --> 00:20:02,560
in these training data,

499
00:20:02,560 --> 00:20:04,391
and then colored the
point of the background

500
00:20:04,391 --> 00:20:06,954
corresponding to what is the class label.

501
00:20:06,954 --> 00:20:09,049
So, you can see that this
nearest neighbor classifier

502
00:20:09,049 --> 00:20:11,032
is just sort of carving up the space

503
00:20:11,032 --> 00:20:14,979
and coloring the space
according to the nearby points.

504
00:20:14,979 --> 00:20:18,320
But this classifier is maybe not so great.

505
00:20:18,320 --> 00:20:20,002
And by looking at this picture

506
00:20:20,002 --> 00:20:22,568
we can start to see some of the
problems that might come out

507
00:20:22,568 --> 00:20:24,676
with a nearest neighbor classifier.

508
00:20:24,676 --> 00:20:27,487
For one, this central
region actually contains

509
00:20:27,487 --> 00:20:29,208
mostly green points,

510
00:20:29,208 --> 00:20:31,591
but one little yellow point in the middle.

511
00:20:31,591 --> 00:20:34,406
But because we're just looking
at the nearest neighbor,

512
00:20:34,406 --> 00:20:36,411
this causes a little
yellow island to appear

513
00:20:36,411 --> 00:20:38,552
in this middle of this green cluster.

514
00:20:38,552 --> 00:20:40,087
And that's, maybe, not so great.

515
00:20:40,087 --> 00:20:44,081
Maybe those points actually
should have been green.

516
00:20:44,081 --> 00:20:47,990
And then, similarly we also
see these, sort of, fingers,

517
00:20:47,990 --> 00:20:50,225
like the green region
pushing into the blue region,

518
00:20:50,225 --> 00:20:52,450
again, due to the presence of one point,

519
00:20:52,450 --> 00:20:55,180
which may have been noisy or spurious.

520
00:20:55,180 --> 00:20:58,205
So, this kind of motivates
a slight generalization

521
00:20:58,205 --> 00:21:01,606
of this algorithm called
k-nearest neighbors.

522
00:21:01,606 --> 00:21:05,070
So rather than just looking for
the single nearest neighbor,

523
00:21:05,070 --> 00:21:08,021
instead we'll do something
a little bit fancier

524
00:21:08,021 --> 00:21:10,627
and find K of our nearest neighbors,

525
00:21:10,627 --> 00:21:12,240
according to our distance metric,

526
00:21:12,240 --> 00:21:15,057
and then take a vote among
each of our neighbors.

527
00:21:15,057 --> 00:21:16,708
And then predict the majority vote

528
00:21:16,708 --> 00:21:18,733
among our neighbors.

529
00:21:18,733 --> 00:21:21,032
You can imagine slightly more
complex ways of doing this.

530
00:21:21,032 --> 00:21:22,919
Maybe you'd vote weighted on the distance,

531
00:21:22,919 --> 00:21:24,148
or something like that,

532
00:21:24,148 --> 00:21:27,912
but the simplest thing that
tends to work pretty well

533
00:21:27,912 --> 00:21:29,810
is just taking a majority vote.

534
00:21:29,810 --> 00:21:32,810
So here we've shown the
exact same set of points

535
00:21:32,810 --> 00:21:35,899
using this K=1 nearest
neighbor classifier,

536
00:21:35,899 --> 00:21:39,928
as well as K=3 and K=5 in
the middle and on the right.

537
00:21:39,928 --> 00:21:43,792
And once we move to K=3, you
can see that that spurious

538
00:21:43,792 --> 00:21:46,186
yellow point in the middle
of the green cluster

539
00:21:46,186 --> 00:21:49,369
is no longer causing the
points near that region

540
00:21:49,369 --> 00:21:50,966
to be classified as yellow.

541
00:21:50,966 --> 00:21:53,684
Now this entire green
portion in the middle

542
00:21:53,684 --> 00:21:55,852
is all being classified as green.

543
00:21:55,852 --> 00:21:57,737
You can also see that these fingers

544
00:21:57,737 --> 00:21:59,272
of the red and blue regions

545
00:21:59,272 --> 00:22:00,823
are starting to get smoothed out

546
00:22:00,823 --> 00:22:02,454
due to this majority voting.

547
00:22:02,454 --> 00:22:05,381
And then, once we move to the K=5 case,

548
00:22:05,381 --> 00:22:06,900
then these decision boundaries

549
00:22:06,900 --> 00:22:08,437
between the blue and red regions

550
00:22:08,437 --> 00:22:12,064
have become quite smooth and quite nice.

551
00:22:12,064 --> 00:22:13,818
So, generally when you're
using nearest neighbors

552
00:22:13,818 --> 00:22:14,727
classifiers,

553
00:22:14,727 --> 00:22:18,718
you almost always want
to use some value of K,

554
00:22:18,718 --> 00:22:20,771
which is larger than one

555
00:22:20,771 --> 00:22:22,897
because this tends to
smooth out your decision

556
00:22:22,897 --> 00:22:26,064
boundaries and lead to better results.

557
00:22:29,252 --> 00:22:30,159
Question?

558
00:22:30,159 --> 00:22:34,279
[student asking a question]

559
00:22:34,279 --> 00:22:35,208
Yes, so the question is,

560
00:22:35,208 --> 00:22:38,133
what is the deal with these white regions?

561
00:22:38,133 --> 00:22:41,025
The white regions are
where there was no majority

562
00:22:41,025 --> 00:22:43,247
among the k-nearest neighbors.

563
00:22:43,247 --> 00:22:45,521
You could imagine maybe doing
something slightly fancier

564
00:22:45,521 --> 00:22:48,972
and maybe taking a guess
or randomly selecting among

565
00:22:48,972 --> 00:22:50,472
the majority winners,

566
00:22:50,472 --> 00:22:52,729
but for this simple example
we're just coloring it white

567
00:22:52,729 --> 00:22:54,335
to indicate there was no nearest neighbor

568
00:22:54,335 --> 00:22:55,668
in those points.

569
00:23:00,005 --> 00:23:02,439
Whenever we're thinking
about computer vision

570
00:23:02,439 --> 00:23:04,225
I think it's really useful to kind of flip

571
00:23:04,225 --> 00:23:06,616
back and forth between
several different viewpoints.

572
00:23:06,616 --> 00:23:09,480
One, is this idea of high
dimensional points in the plane,

573
00:23:09,480 --> 00:23:13,049
and then the other is actually
looking at concrete images.

574
00:23:13,049 --> 00:23:15,118
Because the pixels of the image actually

575
00:23:15,118 --> 00:23:18,185
allow us to think of these
images as high dimensional

576
00:23:18,185 --> 00:23:19,048
vectors.

577
00:23:19,048 --> 00:23:21,181
And it's sort of useful to
ping pong back and forth

578
00:23:21,181 --> 00:23:23,395
between these two different viewpoints.

579
00:23:23,395 --> 00:23:26,193
So then, sort of taking
this k-nearest neighbor

580
00:23:26,193 --> 00:23:27,876
and going back to the images

581
00:23:27,876 --> 00:23:29,443
you can see that it's
actually not very good.

582
00:23:29,443 --> 00:23:31,216
Here I've colored in red and green

583
00:23:31,216 --> 00:23:33,265
which images would actually
be classified correctly

584
00:23:33,265 --> 00:23:35,385
or incorrectly according
to their nearest neighbor.

585
00:23:35,385 --> 00:23:38,288
And you can see that it's
really not very good.

586
00:23:38,288 --> 00:23:41,459
But maybe if we used a larger value of K

587
00:23:41,459 --> 00:23:43,690
then this would involve
actually voting among

588
00:23:43,690 --> 00:23:45,586
maybe the top three or the top five

589
00:23:45,586 --> 00:23:47,264
or maybe even the whole row.

590
00:23:47,264 --> 00:23:48,806
And you could imagine that
that would end up being

591
00:23:48,806 --> 00:23:52,158
a lot more robust to some
of this noise that we see

592
00:23:52,158 --> 00:23:55,325
when retrieving neighbors in this way.

593
00:23:57,070 --> 00:23:59,803
So another choice we
have when we're working

594
00:23:59,803 --> 00:24:01,666
with the k-nearest neighbor algorithm

595
00:24:01,666 --> 00:24:04,187
is determining exactly
how we should be comparing

596
00:24:04,187 --> 00:24:05,727
our different points.

597
00:24:05,727 --> 00:24:09,624
For the examples so far we've just shown

598
00:24:09,624 --> 00:24:11,666
we've talked about this L1 distance

599
00:24:11,666 --> 00:24:13,522
which takes the sum of the absolute values

600
00:24:13,522 --> 00:24:15,126
between the pixels.

601
00:24:15,126 --> 00:24:18,299
But another common choice is
the L2 or Euclidean distance

602
00:24:18,299 --> 00:24:21,266
where you take the square
root of the sum of the squares

603
00:24:21,266 --> 00:24:24,491
and take this as your distance.

604
00:24:24,491 --> 00:24:26,588
Choosing different
distance metrics actually

605
00:24:26,588 --> 00:24:28,727
is a pretty interesting topic

606
00:24:28,727 --> 00:24:30,072
because different distance metrics

607
00:24:30,072 --> 00:24:32,266
make different assumptions
about the underlying

608
00:24:32,266 --> 00:24:35,386
geometry or topology that
you'd expect in the space.

609
00:24:35,386 --> 00:24:39,103
So, this L1 distance, underneath
this, this is actually

610
00:24:39,103 --> 00:24:41,688
a circle according to the L1 distance

611
00:24:41,688 --> 00:24:43,487
and it forms this square shape thing

612
00:24:43,487 --> 00:24:45,020
around the origin.

613
00:24:45,020 --> 00:24:47,637
Where each of the points
on this, on the square,

614
00:24:47,637 --> 00:24:51,017
is equidistant from the
origin according to L1,

615
00:24:51,017 --> 00:24:53,116
whereas with the L2 or Euclidean distance

616
00:24:53,116 --> 00:24:55,290
then this circle is a familiar circle,

617
00:24:55,290 --> 00:24:57,241
it looks like what you'd expect.

618
00:24:57,241 --> 00:24:59,521
So one interesting thing to
point out between these two

619
00:24:59,521 --> 00:25:00,798
metrics in particular,

620
00:25:00,798 --> 00:25:03,672
is that the L1 distance
depends on your choice

621
00:25:03,672 --> 00:25:05,135
of coordinates system.

622
00:25:05,135 --> 00:25:07,306
So if you were to rotate
the coordinate frame

623
00:25:07,306 --> 00:25:08,902
that would actually change the L1 distance

624
00:25:08,902 --> 00:25:10,201
between the points.

625
00:25:10,201 --> 00:25:13,287
Whereas changing the coordinate
frame in the L2 distance

626
00:25:13,287 --> 00:25:15,491
doesn't matter, it's the
same thing no matter what

627
00:25:15,491 --> 00:25:18,281
your coordinate frame is.

628
00:25:18,281 --> 00:25:21,721
Maybe if your input features,
if the individual entries

629
00:25:21,721 --> 00:25:23,866
in your vector have some important meaning

630
00:25:23,866 --> 00:25:24,791
for your task,

631
00:25:24,791 --> 00:25:27,935
then maybe somehow L1 might
be a more natural fit.

632
00:25:27,935 --> 00:25:30,533
But if it's just a generic
vector in some space

633
00:25:30,533 --> 00:25:32,373
and you don't know which
of the different elements,

634
00:25:32,373 --> 00:25:34,121
you don't know what they actually mean,

635
00:25:34,121 --> 00:25:37,531
then maybe L2 is slightly more natural.

636
00:25:37,531 --> 00:25:40,154
And another point here is that

637
00:25:40,154 --> 00:25:41,839
by using different distance metrics

638
00:25:41,839 --> 00:25:43,474
we can actually generalize
the k-nearest neighbor

639
00:25:43,474 --> 00:25:46,343
classifier to many, many
different types of data,

640
00:25:46,343 --> 00:25:48,280
not just vectors, not just images.

641
00:25:48,280 --> 00:25:51,410
So, for example, imagine you
wanted to classify pieces

642
00:25:51,410 --> 00:25:54,073
of text, then the only
thing you need to do

643
00:25:54,073 --> 00:25:55,366
to use k-nearest neighbors

644
00:25:55,366 --> 00:25:57,716
is to specify some distance function

645
00:25:57,716 --> 00:26:01,692
that can measure distances
between maybe two paragraphs

646
00:26:01,692 --> 00:26:03,831
or two sentences or something like that.

647
00:26:03,831 --> 00:26:06,908
So, simply by specifying
different distance metrics

648
00:26:06,908 --> 00:26:09,377
we can actually apply this
algorithm very generally

649
00:26:09,377 --> 00:26:12,701
to basically any type of data.

650
00:26:12,701 --> 00:26:14,598
Even though it's a kind
of simple algorithm,

651
00:26:14,598 --> 00:26:17,200
in general, it's a very
good thing to try first

652
00:26:17,200 --> 00:26:20,283
when you're looking at a new problem.

653
00:26:21,805 --> 00:26:23,986
So then, it's also kind of
interesting to think about

654
00:26:23,986 --> 00:26:25,854
what is actually happening geometrically

655
00:26:25,854 --> 00:26:28,441
if we choose different distance metrics.

656
00:26:28,441 --> 00:26:31,370
So here we see the same
set of points on the left

657
00:26:31,370 --> 00:26:33,577
using the L1, or Manhattan distance,

658
00:26:33,577 --> 00:26:36,767
and then, on the right,
using the familiar L2,

659
00:26:36,767 --> 00:26:38,087
or Euclidean distance.

660
00:26:38,087 --> 00:26:40,298
And you can see that the
shapes of these decision

661
00:26:40,298 --> 00:26:42,476
boundaries actually change quite a bit

662
00:26:42,476 --> 00:26:44,093
between the two metrics.

663
00:26:44,093 --> 00:26:46,792
So when you're looking at
L1 these decision boundaries

664
00:26:46,792 --> 00:26:49,337
tend to follow the coordinate axes.

665
00:26:49,337 --> 00:26:52,168
And this is again because
the L1 depends on our choice

666
00:26:52,168 --> 00:26:53,451
of coordinate system.

667
00:26:53,451 --> 00:26:56,020
Where the L2 sort of doesn't
really care about the

668
00:26:56,020 --> 00:26:57,544
coordinate axis, it
just puts the boundaries

669
00:26:57,544 --> 00:27:00,294
where they should fall naturally.

670
00:27:04,161 --> 00:27:06,406
My confession is that
each of these examples

671
00:27:06,406 --> 00:27:08,669
that I've shown you is
actually from this interactive

672
00:27:08,669 --> 00:27:10,293
web demo that I built,

673
00:27:10,293 --> 00:27:12,918
where you can go and play
with this k-nearest neighbor

674
00:27:12,918 --> 00:27:14,618
classifier on your own.

675
00:27:14,618 --> 00:27:17,938
And this is really hard to
work on a projector screen.

676
00:27:17,938 --> 00:27:21,271
So maybe we'll do that on your own time.

677
00:27:26,820 --> 00:27:29,403
So, let's just go back to here.

678
00:27:32,951 --> 00:27:35,784
Man, this is kind of embarrassing.

679
00:28:07,103 --> 00:28:09,679
Okay, that was way more
trouble than it was worth.

680
00:28:09,679 --> 00:28:12,319
So, let's skip this, but I encourage you

681
00:28:12,319 --> 00:28:13,496
to go play with this in your browser.

682
00:28:13,496 --> 00:28:15,541
It's actually pretty fun

683
00:28:15,541 --> 00:28:17,742
and kind of nice to build intuition about

684
00:28:17,742 --> 00:28:19,246
how the decision boundary changes

685
00:28:19,246 --> 00:28:20,536
as you change the K

686
00:28:20,536 --> 00:28:23,529
and change your distance metric

687
00:28:23,529 --> 00:28:26,029
and all those sorts of things.

688
00:28:30,641 --> 00:28:32,387
Okay, so then the question is

689
00:28:32,387 --> 00:28:34,072
once you're actually trying
to use this algorithm

690
00:28:34,072 --> 00:28:36,146
in practice, there's several choices

691
00:28:36,146 --> 00:28:37,178
you need to make.

692
00:28:37,178 --> 00:28:39,426
We talked about choosing
different values of K.

693
00:28:39,426 --> 00:28:41,520
We talked about choosing
different distance metrics.

694
00:28:41,520 --> 00:28:43,403
And the question becomes

695
00:28:43,403 --> 00:28:45,584
how do you actually make
these choices for your problem

696
00:28:45,584 --> 00:28:47,286
and for your data?

697
00:28:47,286 --> 00:28:51,736
So, these choices, of things
like K and the distance metric,

698
00:28:51,736 --> 00:28:53,937
we call hyperparameters,

699
00:28:53,937 --> 00:28:56,456
because they are not necessarily
learned from the training

700
00:28:56,456 --> 00:28:57,289
data,

701
00:28:57,289 --> 00:29:00,267
instead these are choices about
your algorithm that you make

702
00:29:00,267 --> 00:29:01,313
ahead of time

703
00:29:01,313 --> 00:29:05,623
and there's no way to learn
them directly from the data.

704
00:29:05,623 --> 00:29:08,821
So, the question is how
do you set these things

705
00:29:08,821 --> 00:29:10,260
in practice?

706
00:29:10,260 --> 00:29:12,277
And they turn out to be
very problem-dependent.

707
00:29:12,277 --> 00:29:15,309
And the simple thing that
most people do is simply

708
00:29:15,309 --> 00:29:17,957
try different values of
hyperparameters for your data

709
00:29:17,957 --> 00:29:20,950
and for your problem, and
figure out which one works best.

710
00:29:20,950 --> 00:29:22,404
There's a question?

711
00:29:22,404 --> 00:29:26,071
[student asking a question]

712
00:29:29,589 --> 00:29:32,367
So, the question is, where L1
distance might be preferable

713
00:29:32,367 --> 00:29:34,447
to using L2 distance?

714
00:29:34,447 --> 00:29:36,800
I think it's mainly problem-dependent,

715
00:29:36,800 --> 00:29:38,106
it's sort of difficult to say

716
00:29:38,106 --> 00:29:40,371
in which cases you think
one might be better

717
00:29:40,371 --> 00:29:41,204
than the other.

718
00:29:41,204 --> 00:29:44,719
but I think that because L1
has this sort of coordinate

719
00:29:44,719 --> 00:29:48,877
dependency, it actually depends
on the coordinate system

720
00:29:48,877 --> 00:29:50,185
of your data,

721
00:29:50,185 --> 00:29:52,833
if you know that you have a vector,

722
00:29:52,833 --> 00:29:54,482
and maybe the individual
elements of the vector

723
00:29:54,482 --> 00:29:55,513
have meaning.

724
00:29:55,513 --> 00:29:58,583
Like maybe you're classifying
employees for some reason

725
00:29:58,583 --> 00:30:00,713
and then the different elements
of that vector correspond

726
00:30:00,713 --> 00:30:03,976
to different features or
aspects of an employee.

727
00:30:03,976 --> 00:30:06,432
Like their salary or the
number of years they've been

728
00:30:06,432 --> 00:30:08,778
working at the company
or something like that.

729
00:30:08,778 --> 00:30:10,949
So I think when your
individual elements actually

730
00:30:10,949 --> 00:30:11,860
have some meaning,

731
00:30:11,860 --> 00:30:14,297
is where I think maybe using
L1 might make a little bit

732
00:30:14,297 --> 00:30:15,850
more sense.

733
00:30:15,850 --> 00:30:17,623
But in general, again,
this is a hyperparameter

734
00:30:17,623 --> 00:30:19,989
and it really depends on
your problem and your data

735
00:30:19,989 --> 00:30:22,214
so the best answer is
just to try them both

736
00:30:22,214 --> 00:30:24,381
and see what works better.

737
00:30:28,381 --> 00:30:30,092
Even this idea of trying
out different values

738
00:30:30,092 --> 00:30:32,413
of hyperparameters and
seeing what works best,

739
00:30:32,413 --> 00:30:34,238
there are many different choices here.

740
00:30:34,238 --> 00:30:36,287
What exactly does it mean
to try hyperparameters

741
00:30:36,287 --> 00:30:38,268
and see what works best?

742
00:30:38,268 --> 00:30:40,391
Well, the first idea you might think of

743
00:30:40,391 --> 00:30:42,911
is simply choosing the
hyperparameters that give you

744
00:30:42,911 --> 00:30:45,032
the best accuracy or best performance

745
00:30:45,032 --> 00:30:47,691
on your training data.

746
00:30:47,691 --> 00:30:49,961
This is actually a really terrible idea.

747
00:30:49,961 --> 00:30:52,137
You should never do this.

748
00:30:52,137 --> 00:30:54,081
In the concrete case
of the nearest neighbor

749
00:30:54,081 --> 00:30:55,717
classifier, for example,

750
00:30:55,717 --> 00:30:59,324
if we set K=1, we will always
classify the training data

751
00:30:59,324 --> 00:31:00,157
perfectly.

752
00:31:01,124 --> 00:31:04,420
So if we use this strategy
we'll always pick K=1,

753
00:31:04,420 --> 00:31:06,390
but, as we saw from the examples earlier,

754
00:31:06,390 --> 00:31:10,446
in practice it seems that
setting K equals to larger values

755
00:31:10,446 --> 00:31:13,184
might cause us to misclassify
some of the training data,

756
00:31:13,184 --> 00:31:14,713
but, in fact, lead to better performance

757
00:31:14,713 --> 00:31:17,915
on points that were not
in the training data.

758
00:31:17,915 --> 00:31:19,208
And ultimately in machine learning

759
00:31:19,208 --> 00:31:21,434
we don't care about
fitting the training data,

760
00:31:21,434 --> 00:31:23,317
we really care about how our classifier,

761
00:31:23,317 --> 00:31:24,463
or how our method,

762
00:31:24,463 --> 00:31:27,051
will perform on unseen
data after training.

763
00:31:27,051 --> 00:31:30,495
So, this is a terrible
idea, don't do this.

764
00:31:30,495 --> 00:31:32,687
So, another idea that you might think of,

765
00:31:32,687 --> 00:31:34,856
is maybe we'll take our full dataset

766
00:31:34,856 --> 00:31:36,532
and we'll split it into some training data

767
00:31:36,532 --> 00:31:38,390
and some test data.

768
00:31:38,390 --> 00:31:42,294
And now I'll try training
my algorithm with different

769
00:31:42,294 --> 00:31:44,409
choices of hyperparameters
on the training data

770
00:31:44,409 --> 00:31:47,263
and then I'll go and apply
that trained classifier

771
00:31:47,263 --> 00:31:49,584
on the test data and now I will pick

772
00:31:49,584 --> 00:31:52,299
the set of hyperparameters
that cause me to perform best

773
00:31:52,299 --> 00:31:53,716
on the test data.

774
00:31:54,582 --> 00:31:57,414
This seems like maybe a
more reasonable strategy,

775
00:31:57,414 --> 00:31:59,546
but, in fact, this is also a terrible idea

776
00:31:59,546 --> 00:32:01,087
and you should never do this.

777
00:32:01,087 --> 00:32:03,723
Because, again, the point
of machine learning systems

778
00:32:03,723 --> 00:32:06,515
is that we want to know how
our algorithm will perform.

779
00:32:06,515 --> 00:32:08,017
So, the point of the test set

780
00:32:08,017 --> 00:32:10,760
is to give us some estimate
of how our method will do

781
00:32:10,760 --> 00:32:14,523
on unseen data that's
coming out from the wild.

782
00:32:14,523 --> 00:32:17,710
And if we use this strategy
of training many different

783
00:32:17,710 --> 00:32:19,749
algorithms with different hyperparameters,

784
00:32:19,749 --> 00:32:22,133
and then, selecting the
one which does the best

785
00:32:22,133 --> 00:32:23,363
on the test data,

786
00:32:23,363 --> 00:32:26,404
then, it's possible, that
we may have just picked

787
00:32:26,404 --> 00:32:28,045
the right set of hyperparameters

788
00:32:28,045 --> 00:32:30,319
that caused our algorithm
to work quite well

789
00:32:30,319 --> 00:32:31,663
on this testing set,

790
00:32:31,663 --> 00:32:33,776
but now our performance on this test set

791
00:32:33,776 --> 00:32:35,488
will no longer be representative

792
00:32:35,488 --> 00:32:38,280
of our performance of new, unseen data.

793
00:32:38,280 --> 00:32:41,491
So, again, you should not
do this, this is a bad idea,

794
00:32:41,491 --> 00:32:44,672
you'll get in trouble if you do this.

795
00:32:44,672 --> 00:32:47,025
What is much more common, is
to actually split your data

796
00:32:47,025 --> 00:32:49,192
into three different sets.

797
00:32:50,185 --> 00:32:54,165
You'll partition most of
your data into a training set

798
00:32:54,165 --> 00:32:56,159
and then you'll create a validation set

799
00:32:56,159 --> 00:32:57,305
and a test set.

800
00:32:57,305 --> 00:33:01,465
And now what we typically do
is go and train our algorithm

801
00:33:01,465 --> 00:33:03,500
with many different
choices of hyperparameters

802
00:33:03,500 --> 00:33:05,172
on the training set,

803
00:33:05,172 --> 00:33:07,227
evaluate on the validation set,

804
00:33:07,227 --> 00:33:08,802
and now pick the set of hyperparameters

805
00:33:08,802 --> 00:33:11,621
which performs best on the validation set.

806
00:33:11,621 --> 00:33:13,719
And now, after you've
done all your development,

807
00:33:13,719 --> 00:33:14,899
you've done all your debugging,

808
00:33:14,899 --> 00:33:16,627
after you've dome everything,

809
00:33:16,627 --> 00:33:20,000
then you'd take that best
performing classifier

810
00:33:20,000 --> 00:33:21,614
on the validation set

811
00:33:21,614 --> 00:33:23,401
and run it once on the test set.

812
00:33:23,401 --> 00:33:25,005
And now that's the number
that goes into your paper,

813
00:33:25,005 --> 00:33:27,139
that's the number that
goes into your report,

814
00:33:27,139 --> 00:33:29,542
that's the number that
actually is telling you how

815
00:33:29,542 --> 00:33:32,393
your algorithm is doing on unseen data.

816
00:33:32,393 --> 00:33:34,335
And this is actually
really, really important

817
00:33:34,335 --> 00:33:36,516
that you keep a very
strict separation between

818
00:33:36,516 --> 00:33:38,847
the validation data and the test data.

819
00:33:38,847 --> 00:33:41,437
So, for example, when we're
working on research papers,

820
00:33:41,437 --> 00:33:43,303
we typically only touch the test set

821
00:33:43,303 --> 00:33:45,653
at the very last minute.

822
00:33:45,653 --> 00:33:47,029
So, when I'm writing papers,

823
00:33:47,029 --> 00:33:49,472
I tend to only touch the
test set for my problem

824
00:33:49,472 --> 00:33:51,809
in maybe the week before
the deadline or so

825
00:33:51,809 --> 00:33:54,159
to really insure that we're not

826
00:33:54,159 --> 00:33:56,665
being dishonest here and
we're not reporting a number

827
00:33:56,665 --> 00:33:57,961
which is unfair.

828
00:33:57,961 --> 00:34:00,102
So, this is actually super important

829
00:34:00,102 --> 00:34:02,216
and you want to make sure
to keep your test data

830
00:34:02,216 --> 00:34:03,883
quite under control.

831
00:34:06,468 --> 00:34:08,721
So another strategy for
setting hyperparameters

832
00:34:08,721 --> 00:34:10,840
is called cross validation.

833
00:34:10,840 --> 00:34:13,868
And this is used a
little bit more commonly

834
00:34:13,868 --> 00:34:17,317
for small data sets, not used
so much in deep learning.

835
00:34:17,317 --> 00:34:20,201
So here the idea is we're
going to take our test data,

836
00:34:20,201 --> 00:34:21,674
or we're going to take our dataset,

837
00:34:21,674 --> 00:34:25,534
as usual, hold out some test
set to use at the very end,

838
00:34:25,534 --> 00:34:27,023
and now, for the rest of the data,

839
00:34:27,023 --> 00:34:29,188
rather than splitting it
into a single training

840
00:34:29,188 --> 00:34:31,265
and validation partition,

841
00:34:31,266 --> 00:34:33,851
instead, we can split our training data

842
00:34:33,851 --> 00:34:35,515
into many different folds.

843
00:34:35,516 --> 00:34:38,879
And now, in this way, we've
cycled through choosing which

844
00:34:38,879 --> 00:34:41,415
fold is going to be the validation set.

845
00:34:41,415 --> 00:34:43,286
So now, in this example,

846
00:34:43,286 --> 00:34:45,498
we're using five fold cross validation,

847
00:34:45,498 --> 00:34:47,594
so you would train your
algorithm with one set of

848
00:34:47,594 --> 00:34:49,984
hyperparameters on the first four folds,

849
00:34:49,985 --> 00:34:51,928
evaluate the performance on fold four,

850
00:34:51,928 --> 00:34:54,177
and now go and retrain
your algorithm on folds

851
00:34:54,177 --> 00:34:55,712
one, two, three, and five,

852
00:34:55,712 --> 00:34:57,769
evaluate on fold four,

853
00:34:57,769 --> 00:35:00,293
and cycle through all the different folds.

854
00:35:00,293 --> 00:35:01,765
And, when you do it this way,

855
00:35:01,765 --> 00:35:04,098
you get much higher confidence about

856
00:35:04,098 --> 00:35:06,215
which hyperparameters are going to perform

857
00:35:06,215 --> 00:35:07,511
more robustly.

858
00:35:07,511 --> 00:35:09,714
So this is kind of the
gold standard to use,

859
00:35:09,714 --> 00:35:11,749
but, in practice in deep learning

860
00:35:11,749 --> 00:35:13,285
when we're training large models

861
00:35:13,285 --> 00:35:15,972
and training is very
computationally expensive,

862
00:35:15,972 --> 00:35:18,652
these doesn't get used
too much in practice.

863
00:35:18,652 --> 00:35:19,519
Question?

864
00:35:19,519 --> 00:35:23,186
[student asking a question]

865
00:35:29,515 --> 00:35:31,371
Yeah, so the question is,

866
00:35:31,371 --> 00:35:32,634
a little bit more concretely,

867
00:35:32,634 --> 00:35:34,320
what's the difference
between the training and the

868
00:35:34,320 --> 00:35:35,728
validation set?

869
00:35:35,728 --> 00:35:39,895
So, if you think about the
k-nearest neighbor classifier

870
00:35:41,060 --> 00:35:45,300
then the training set is this
set of images with labels

871
00:35:45,300 --> 00:35:46,974
where we memorize the labels.

872
00:35:46,974 --> 00:35:48,607
And now, to classify an image,

873
00:35:48,607 --> 00:35:51,405
we're going to take the image
and compare it to each element

874
00:35:51,405 --> 00:35:52,451
in the training data,

875
00:35:52,451 --> 00:35:56,618
and then transfer the label
from the nearest training point.

876
00:36:00,052 --> 00:36:02,102
So now our algorithm
will memorize everything

877
00:36:02,102 --> 00:36:03,184
in the training set,

878
00:36:03,184 --> 00:36:05,934
and now we'll take each
element of the validation set

879
00:36:05,934 --> 00:36:08,062
and compare it to each
element in the training data

880
00:36:08,062 --> 00:36:12,081
and then use this to
determine what is the accuracy

881
00:36:12,081 --> 00:36:16,815
of our classifier when it's
applied on the validation set.

882
00:36:16,815 --> 00:36:18,746
So this is the distinction
between training

883
00:36:18,746 --> 00:36:19,919
and validation.

884
00:36:19,919 --> 00:36:22,364
Where your algorithm is
able to see the labels

885
00:36:22,364 --> 00:36:24,207
of the training set,

886
00:36:24,207 --> 00:36:25,528
but for the validation set,

887
00:36:25,528 --> 00:36:28,473
your algorithm doesn't have
direct access to the labels.

888
00:36:28,473 --> 00:36:30,818
We only use the labels
of the validation set

889
00:36:30,818 --> 00:36:34,043
to check how well our algorithm is doing.

890
00:36:34,043 --> 00:36:34,907
A question?

891
00:36:34,907 --> 00:36:38,574
[student asking a question]

892
00:36:44,373 --> 00:36:47,736
The question is, whether the test set,

893
00:36:47,736 --> 00:36:49,442
is it possible that the
test set might not be

894
00:36:49,442 --> 00:36:52,941
representative of data
out there in the wild?

895
00:36:52,941 --> 00:36:55,955
This definitely can be
a problem in practice,

896
00:36:55,955 --> 00:36:58,252
the underlying statistical
assumption here is that

897
00:36:58,252 --> 00:37:01,863
your data are all independently
and identically distributed,

898
00:37:01,863 --> 00:37:05,280
so that all of your data points should be

899
00:37:06,631 --> 00:37:10,125
drawn from the same underlying
probability distribution.

900
00:37:10,125 --> 00:37:12,948
Of course, in practice, this
might not always be the case,

901
00:37:12,948 --> 00:37:14,768
and you definitely can run into cases

902
00:37:14,768 --> 00:37:18,798
where the test set might
not be super representative

903
00:37:18,798 --> 00:37:20,951
of what you see in the wild.

904
00:37:20,951 --> 00:37:23,826
So this is kind of a problem
that dataset creators and

905
00:37:23,826 --> 00:37:25,473
dataset curators need to think about.

906
00:37:25,473 --> 00:37:27,712
But when I'm creating
datasets, for example,

907
00:37:27,712 --> 00:37:28,626
one thing I do,

908
00:37:28,626 --> 00:37:31,454
is I'll go and collect a whole
bunch of data all at once,

909
00:37:31,454 --> 00:37:34,252
using the exact same methodology
for collecting the data,

910
00:37:34,252 --> 00:37:36,656
and then afterwards you go
and partition it randomly

911
00:37:36,656 --> 00:37:38,690
between train and test.

912
00:37:38,690 --> 00:37:40,962
One thing that can screw you up here is

913
00:37:40,962 --> 00:37:43,024
maybe if you're collecting data over time

914
00:37:43,024 --> 00:37:45,489
and you make the earlier
data, that you collect first,

915
00:37:45,489 --> 00:37:46,343
be the training data,

916
00:37:46,343 --> 00:37:48,601
and the later data that you
collect be the test data,

917
00:37:48,601 --> 00:37:50,300
then you actually might
run into this shift

918
00:37:50,300 --> 00:37:51,613
that could cause problems.

919
00:37:51,613 --> 00:37:53,971
But as long as this partition is random

920
00:37:53,971 --> 00:37:55,831
among your entire set of data points,

921
00:37:55,831 --> 00:37:58,762
then that's how we try
to alleviate this problem

922
00:37:58,762 --> 00:37:59,762
in practice.

923
00:38:04,297 --> 00:38:06,619
So then, once you've gone through this

924
00:38:06,619 --> 00:38:08,351
cross validation procedure,

925
00:38:08,351 --> 00:38:11,493
then you end up with graphs
that look something like this.

926
00:38:11,493 --> 00:38:14,856
So here, on the X axis, we
are showing the value of K

927
00:38:14,856 --> 00:38:17,354
for a k-nearest neighbor
classifier on some problem,

928
00:38:17,354 --> 00:38:21,565
and now on the Y axis, we are
showing what is the accuracy

929
00:38:21,565 --> 00:38:25,102
of our classifier on some dataset

930
00:38:25,102 --> 00:38:26,961
for different values of K.

931
00:38:26,961 --> 00:38:28,332
And you can see that, in this case,

932
00:38:28,332 --> 00:38:31,705
we've done five fold cross
validation over the data,

933
00:38:31,705 --> 00:38:34,639
so, for each value of K we
have five different examples

934
00:38:34,639 --> 00:38:38,346
of how well this algorithm is doing.

935
00:38:38,346 --> 00:38:41,050
And, actually, going back
to the question about

936
00:38:41,050 --> 00:38:43,165
having some test sets
that are better or worse

937
00:38:43,165 --> 00:38:44,748
for your algorithm,

938
00:38:46,239 --> 00:38:47,683
using K fold cross validation

939
00:38:47,683 --> 00:38:50,276
is maybe one way to help
quantify that a little bit.

940
00:38:50,276 --> 00:38:54,935
And, in that, we can see the
variance of how this algorithm

941
00:38:54,935 --> 00:38:57,606
performs on different
of the validation folds.

942
00:38:57,606 --> 00:38:58,981
And that gives you some sense of,

943
00:38:58,981 --> 00:39:00,359
not just what is the best,

944
00:39:00,359 --> 00:39:04,301
but, also, what is the
distribution of that performance.

945
00:39:04,301 --> 00:39:06,442
So, whenever you're training
machine learning models

946
00:39:06,442 --> 00:39:08,151
you end up making plots like this,

947
00:39:08,151 --> 00:39:09,744
where they show you what is your accuracy,

948
00:39:09,744 --> 00:39:12,380
or your performance as a
function of your hyperparameters,

949
00:39:12,380 --> 00:39:14,857
and then you want to
go and pick the model,

950
00:39:14,857 --> 00:39:16,151
or the set of hyperparameters,

951
00:39:16,151 --> 00:39:17,198
at the end of the day,

952
00:39:17,198 --> 00:39:19,995
that performs the best
on the validation set.

953
00:39:19,995 --> 00:39:23,195
So, here we see that maybe
about K=7 probably works

954
00:39:23,195 --> 00:39:25,528
about best for this problem.

955
00:39:29,713 --> 00:39:32,007
So, k-nearest neighbor
classifiers on images

956
00:39:32,007 --> 00:39:34,458
are actually almost
never used in practice.

957
00:39:34,458 --> 00:39:38,233
Because, with all of these
problems that we've talked about.

958
00:39:38,233 --> 00:39:41,393
So, one problem is that
it's very slow at test time,

959
00:39:41,393 --> 00:39:43,115
which is the reverse of what we want,

960
00:39:43,115 --> 00:39:44,287
which we talked about earlier.

961
00:39:44,287 --> 00:39:45,582
Another problem is that

962
00:39:45,582 --> 00:39:48,859
these things like Euclidean
distance, or L1 distance,

963
00:39:48,859 --> 00:39:51,648
are really not a very good way

964
00:39:51,648 --> 00:39:54,495
to measure distances between images.

965
00:39:54,495 --> 00:39:57,247
These, sort of, vectorial
distance functions

966
00:39:57,247 --> 00:39:59,857
do not correspond very well
to perceptual similarity

967
00:39:59,857 --> 00:40:01,254
between images.

968
00:40:01,254 --> 00:40:04,076
How you perceive
differences between images.

969
00:40:04,076 --> 00:40:06,910
So, in this example, we've constructed,

970
00:40:06,910 --> 00:40:08,796
there's this image on the left of a girl,

971
00:40:08,796 --> 00:40:11,234
and then three different
distorted images on the right

972
00:40:11,234 --> 00:40:12,930
where we've blocked out her mouth,

973
00:40:12,930 --> 00:40:16,029
we've actually shifted
down by a couple pixels,

974
00:40:16,029 --> 00:40:18,416
or tinted the entire image blue.

975
00:40:18,416 --> 00:40:20,849
And, actually, if you compute
the Euclidean distance

976
00:40:20,849 --> 00:40:22,785
between the original and the boxed,

977
00:40:22,785 --> 00:40:24,145
the original and the shuffled,

978
00:40:24,145 --> 00:40:25,425
and original in the tinted,

979
00:40:25,425 --> 00:40:27,735
they all have the same L2 distance.

980
00:40:27,735 --> 00:40:29,469
Which is, maybe, not so good

981
00:40:29,469 --> 00:40:32,585
because it sort of
gives you the sense that

982
00:40:32,585 --> 00:40:34,952
the L2 distance is really
not doing a very good job

983
00:40:34,952 --> 00:40:39,119
at capturing these perceptional
distances between images.

984
00:40:40,642 --> 00:40:42,818
Another, sort of, problem
with the k-nearest neighbor

985
00:40:42,818 --> 00:40:45,338
classifier has to do with
something we call the curse

986
00:40:45,338 --> 00:40:46,819
of dimensionality.

987
00:40:46,819 --> 00:40:49,744
So, if you recall back this
viewpoint we had of the

988
00:40:49,744 --> 00:40:51,279
k-nearest neighbor classifier,

989
00:40:51,279 --> 00:40:53,811
it's sort of dropping paint
around each of the training

990
00:40:53,811 --> 00:40:57,186
data points and using that to
sort of partition the space.

991
00:40:57,186 --> 00:40:59,517
So that means that if we
expect the k-nearest neighbor

992
00:40:59,517 --> 00:41:01,105
classifier to work well,

993
00:41:01,105 --> 00:41:04,160
we kind of need our training
examples to cover the space

994
00:41:04,160 --> 00:41:05,646
quite densely.

995
00:41:05,646 --> 00:41:10,525
Otherwise our nearest neighbors
could actually be quite far

996
00:41:10,525 --> 00:41:14,171
away and might not actually
be very similar to our testing

997
00:41:14,171 --> 00:41:15,004
points.

998
00:41:16,738 --> 00:41:17,571
And the problem is,

999
00:41:17,571 --> 00:41:19,197
that actually densely covering the space,

1000
00:41:19,197 --> 00:41:21,467
means that we need a number
of training examples,

1001
00:41:21,467 --> 00:41:24,666
which is exponential in the
dimension of the problem.

1002
00:41:24,666 --> 00:41:29,217
So this is very bad, exponential
growth is always bad,

1003
00:41:29,217 --> 00:41:31,310
basically, you're never
going to get enough images

1004
00:41:31,310 --> 00:41:33,422
to densely cover this space of pixels

1005
00:41:33,422 --> 00:41:35,587
in this high dimensional space.

1006
00:41:35,587 --> 00:41:37,487
So that's maybe another
thing to keep in mind

1007
00:41:37,487 --> 00:41:41,300
when you're thinking about
using k-nearest neighbor.

1008
00:41:41,300 --> 00:41:42,749
So, kind of the summary
is that we're using

1009
00:41:42,749 --> 00:41:44,460
k-nearest neighbor to introduce this idea

1010
00:41:44,460 --> 00:41:45,701
of image classification.

1011
00:41:45,701 --> 00:41:47,634
We have a training set
of images and labels

1012
00:41:47,634 --> 00:41:48,926
and then we use that

1013
00:41:48,926 --> 00:41:51,445
to predict these labels on the test set.

1014
00:41:51,445 --> 00:41:52,278
Question?

1015
00:41:52,278 --> 00:41:54,519
[student asking a question]

1016
00:41:54,519 --> 00:41:55,352
Oh, sorry, the question is,

1017
00:41:55,352 --> 00:41:57,271
what was going on with this picture?

1018
00:41:57,271 --> 00:41:59,168
What are the green and the blue dots?

1019
00:41:59,168 --> 00:42:02,335
So here, we have some training samples

1020
00:42:03,463 --> 00:42:05,596
which are represented by points,

1021
00:42:05,596 --> 00:42:08,128
and the color of the dot
maybe represents the category

1022
00:42:08,128 --> 00:42:11,004
of the point, of this training sample.

1023
00:42:11,004 --> 00:42:12,629
So, if we're in one dimension,

1024
00:42:12,629 --> 00:42:15,161
then you maybe only need
four training samples

1025
00:42:15,161 --> 00:42:17,525
to densely cover the space,

1026
00:42:17,525 --> 00:42:19,478
but if we move to two dimensions,

1027
00:42:19,478 --> 00:42:22,701
then, we now need, four times
four is 16 training examples

1028
00:42:22,701 --> 00:42:25,529
to densely cover this space.

1029
00:42:25,529 --> 00:42:28,767
And if we move to three, four,
five, many more dimensions,

1030
00:42:28,767 --> 00:42:30,431
the number of training
examples that we need

1031
00:42:30,431 --> 00:42:32,344
to densely cover the space,

1032
00:42:32,344 --> 00:42:35,200
grows exponentially with the dimension.

1033
00:42:35,200 --> 00:42:36,536
So, this is kind of giving you the sense,

1034
00:42:36,536 --> 00:42:37,601
that maybe in two dimensions

1035
00:42:37,601 --> 00:42:40,501
we might have this kind
of funny curved shape,

1036
00:42:40,501 --> 00:42:44,848
or you might have sort of
arbitrary manifolds of labels

1037
00:42:44,848 --> 00:42:47,641
in different dimensional spaces.

1038
00:42:47,641 --> 00:42:49,403
Because the k-nearest neighbor algorithm

1039
00:42:49,403 --> 00:42:51,704
doesn't really make any
assumptions about these

1040
00:42:51,704 --> 00:42:53,029
underlying manifolds,

1041
00:42:53,029 --> 00:42:54,565
the only way it can perform properly

1042
00:42:54,565 --> 00:42:57,713
is if it has quite a dense
sample of training points

1043
00:42:57,713 --> 00:42:58,796
to work with.

1044
00:43:01,741 --> 00:43:04,915
So, this is kind of the
overview of k-nearest neighbors

1045
00:43:04,915 --> 00:43:07,015
and you'll get a chance
to actually implement this

1046
00:43:07,015 --> 00:43:11,214
and try it out on images
in the first assignment.

1047
00:43:11,214 --> 00:43:12,953
So, if there's any last minute
questions about K and N,

1048
00:43:12,953 --> 00:43:16,059
I'm going to move on to the next topic.

1049
00:43:16,059 --> 00:43:16,892
Question?

1050
00:43:16,892 --> 00:43:21,014
[student is asking a question]

1051
00:43:21,014 --> 00:43:22,034
Sorry, say that again.

1052
00:43:22,034 --> 00:43:25,951
[student is asking a question]

1053
00:43:28,437 --> 00:43:29,270
Yeah, so the question is,

1054
00:43:29,270 --> 00:43:32,033
why do these images have
the same L2 distance?

1055
00:43:32,033 --> 00:43:34,036
And the answer is that, I
carefully constructed them

1056
00:43:34,036 --> 00:43:35,915
to have the same L2 distance.

1057
00:43:35,915 --> 00:43:38,716
[laughing]

1058
00:43:38,716 --> 00:43:41,812
But it's just giving you the
sense that the L2 distance

1059
00:43:41,812 --> 00:43:45,096
is not a very good measure
of similarity between images.

1060
00:43:45,096 --> 00:43:47,306
And these images are
actually all different from

1061
00:43:47,306 --> 00:43:50,223
each other in quite disparate ways.

1062
00:43:52,470 --> 00:43:54,498
If you're using K and N,

1063
00:43:54,498 --> 00:43:56,654
then the only thing you
have to measure distance

1064
00:43:56,654 --> 00:43:57,649
between images,

1065
00:43:57,649 --> 00:44:00,236
is this single distance metric.

1066
00:44:00,236 --> 00:44:03,200
And this kind of gives
you an example where

1067
00:44:03,200 --> 00:44:05,229
that distance metric is
actually not capturing

1068
00:44:05,229 --> 00:44:07,331
the full description of
distance or difference

1069
00:44:07,331 --> 00:44:08,949
between images.

1070
00:44:08,949 --> 00:44:12,042
So, if this case, I just sort
of carefully constructed these

1071
00:44:12,042 --> 00:44:15,384
translations and these
offsets to match exactly.

1072
00:44:15,384 --> 00:44:16,217
Question?

1073
00:44:16,217 --> 00:44:19,884
[student asking a question]

1074
00:44:28,672 --> 00:44:29,925
So, the question is,

1075
00:44:29,925 --> 00:44:31,249
maybe this is actually good,

1076
00:44:31,249 --> 00:44:33,228
because all of these things

1077
00:44:33,228 --> 00:44:36,615
are actually having the
same distance to the image.

1078
00:44:36,615 --> 00:44:38,024
That's maybe true for this example,

1079
00:44:38,024 --> 00:44:40,390
but I think you could also
construct examples where

1080
00:44:40,390 --> 00:44:41,810
maybe we have two original images

1081
00:44:41,810 --> 00:44:43,352
and then by putting the
boxes in the right places

1082
00:44:43,352 --> 00:44:44,270
or tinting them,

1083
00:44:44,270 --> 00:44:46,652
we could cause it to be
nearer to pretty much

1084
00:44:46,652 --> 00:44:48,316
anything that you want, right?

1085
00:44:48,316 --> 00:44:50,850
Because in this example, we
can kind of like do arbitrary

1086
00:44:50,850 --> 00:44:52,007
shifting and tinting

1087
00:44:52,007 --> 00:44:54,612
to kind of change these
distances nearly arbitrarily

1088
00:44:54,612 --> 00:44:57,469
without changing the perceptional
nature of these images.

1089
00:44:57,469 --> 00:44:59,256
So, I think that this
can actually screw you up

1090
00:44:59,256 --> 00:45:03,172
if you have many
different original images.

1091
00:45:03,172 --> 00:45:04,039
Question?

1092
00:45:04,039 --> 00:45:07,956
[student is asking a question]

1093
00:45:15,207 --> 00:45:16,040
The question is,

1094
00:45:16,040 --> 00:45:17,339
whether or not it's
common in real-world cases

1095
00:45:17,339 --> 00:45:20,664
to go back and retrain the entire dataset

1096
00:45:20,664 --> 00:45:24,098
once you've found those
best hyperparameters?

1097
00:45:24,098 --> 00:45:27,765
So, people do sometimes
do this in practice,

1098
00:45:28,787 --> 00:45:30,982
but it's somewhat a matter of taste.

1099
00:45:30,982 --> 00:45:32,568
If you're really rushing for that deadline

1100
00:45:32,568 --> 00:45:34,430
and you've really got to
get this model out the door

1101
00:45:34,430 --> 00:45:36,781
then, if it takes a long
time to retrain the model

1102
00:45:36,781 --> 00:45:38,176
on the whole dataset,

1103
00:45:38,176 --> 00:45:39,875
then maybe you won't do it.

1104
00:45:39,875 --> 00:45:41,776
But if you have a little
bit more time to spare

1105
00:45:41,776 --> 00:45:43,432
and a little bit more compute to spare,

1106
00:45:43,432 --> 00:45:45,929
and you want to squeeze out
that maybe that extra 1%

1107
00:45:45,929 --> 00:45:50,012
of performance, then that
is a trick you can use.

1108
00:45:53,288 --> 00:45:54,858
So we kind of saw that
the k-nearest neighbor

1109
00:45:54,858 --> 00:45:56,758
has a lot of the nice properties

1110
00:45:56,758 --> 00:45:58,409
of machine learning algorithms,

1111
00:45:58,409 --> 00:46:00,490
but in practice it's not so great,

1112
00:46:00,490 --> 00:46:03,823
and really not used very much in images.

1113
00:46:05,258 --> 00:46:07,134
So the next thing I'd
like to talk about is

1114
00:46:07,134 --> 00:46:08,895
linear classification.

1115
00:46:08,895 --> 00:46:11,753
And linear classification is,
again, quite a simple learning

1116
00:46:11,753 --> 00:46:14,981
algorithm, but this will
become super important

1117
00:46:14,981 --> 00:46:17,257
and help us build up to
whole neural networks

1118
00:46:17,257 --> 00:46:19,845
and whole convolutional networks.

1119
00:46:19,845 --> 00:46:21,736
So, one analogy people often talk about

1120
00:46:21,736 --> 00:46:23,470
when working with neural networks

1121
00:46:23,470 --> 00:46:26,242
is we think of them as being
kind of like Lego blocks.

1122
00:46:26,242 --> 00:46:28,154
That you can have different
kinds of components

1123
00:46:28,154 --> 00:46:30,345
of neural networks and you
can stick these components

1124
00:46:30,345 --> 00:46:33,814
together to build these
large different towers of

1125
00:46:33,814 --> 00:46:36,378
convolutional networks.

1126
00:46:36,378 --> 00:46:38,404
One of the most basic
building blocks that we'll see

1127
00:46:38,404 --> 00:46:41,513
in different types of
deep learning applications

1128
00:46:41,513 --> 00:46:43,215
is this linear classifier.

1129
00:46:43,215 --> 00:46:44,996
So, I think it's actually
really important to

1130
00:46:44,996 --> 00:46:46,905
have a good understanding
of what's happening

1131
00:46:46,905 --> 00:46:48,270
with linear classification.

1132
00:46:48,270 --> 00:46:50,236
Because these will end up
generalizing quite nicely

1133
00:46:50,236 --> 00:46:52,712
to whole neural networks.

1134
00:46:52,712 --> 00:46:55,243
So another example of kind
of this modular nature

1135
00:46:55,243 --> 00:46:56,139
of neural networks

1136
00:46:56,139 --> 00:46:58,728
comes from some research in our
own lab on image captioning,

1137
00:46:58,728 --> 00:47:00,281
just as a little bit of a preview.

1138
00:47:00,281 --> 00:47:02,847
So here the setup is that
we want to input an image

1139
00:47:02,847 --> 00:47:04,789
and then output a descriptive sentence

1140
00:47:04,789 --> 00:47:06,226
describing the image.

1141
00:47:06,226 --> 00:47:08,162
And the way this kind of works is that

1142
00:47:08,162 --> 00:47:10,710
we have one convolutional
neural network that's looking

1143
00:47:10,710 --> 00:47:11,548
at the image,

1144
00:47:11,548 --> 00:47:13,444
and a recurrent neural network that knows

1145
00:47:13,444 --> 00:47:14,496
about language.

1146
00:47:14,496 --> 00:47:16,664
And we can kind of just stick
these two pieces together

1147
00:47:16,664 --> 00:47:18,776
like Lego blocks and train
the whole thing together

1148
00:47:18,776 --> 00:47:20,919
and end up with a pretty cool system

1149
00:47:20,919 --> 00:47:22,767
that can do some non-trivial things.

1150
00:47:22,767 --> 00:47:25,077
And we'll work through the
details of this model as we go

1151
00:47:25,077 --> 00:47:26,388
forward in the class,

1152
00:47:26,388 --> 00:47:27,711
but this just gives you the sense that,

1153
00:47:27,711 --> 00:47:30,209
these deep neural networks
are kind of like Legos

1154
00:47:30,209 --> 00:47:31,999
and this linear classifier

1155
00:47:31,999 --> 00:47:34,082
is kind of like the most
basic building blocks

1156
00:47:34,082 --> 00:47:36,082
of these giant networks.

1157
00:47:37,096 --> 00:47:39,189
But that's a little bit too
exciting for lecture two,

1158
00:47:39,189 --> 00:47:41,257
so we have to go back to
CIFAR-10 for the moment.

1159
00:47:41,257 --> 00:47:42,375
[laughing]

1160
00:47:42,375 --> 00:47:45,641
So, recall that CIFAR-10 has
these 50,000 training examples,

1161
00:47:45,641 --> 00:47:49,808
each image is 32 by 32 pixels
and three color channels.

1162
00:47:52,068 --> 00:47:53,963
In linear classification,
we're going to take a bit

1163
00:47:53,963 --> 00:47:56,696
of a different approach
from k-nearest neighbor.

1164
00:47:56,696 --> 00:48:01,646
So, the linear classifier is
one of the simplest examples

1165
00:48:01,646 --> 00:48:04,734
of what we call a parametric model.

1166
00:48:04,734 --> 00:48:07,548
So now, our parametric model
actually has two different

1167
00:48:07,548 --> 00:48:08,685
components.

1168
00:48:08,685 --> 00:48:12,201
It's going to take in this image,
maybe, of a cat on the left,

1169
00:48:12,201 --> 00:48:13,539
and this,

1170
00:48:13,539 --> 00:48:17,384
that we usually write
as X for our input data,

1171
00:48:17,384 --> 00:48:20,220
and also a set of parameters, or weights,

1172
00:48:20,220 --> 00:48:23,475
which is usually called
W, also sometimes theta,

1173
00:48:23,475 --> 00:48:24,767
depending on the literature.

1174
00:48:24,767 --> 00:48:26,974
And now we're going to
write down some function

1175
00:48:26,974 --> 00:48:30,780
which takes in both the data,
X, and the parameters, W,

1176
00:48:30,780 --> 00:48:34,180
and this'll spit out now
10 numbers describing

1177
00:48:34,180 --> 00:48:37,950
what are the scores
corresponding to each of those 10

1178
00:48:37,950 --> 00:48:39,991
categories in CIFAR-10.

1179
00:48:39,991 --> 00:48:44,232
With the interpretation that,
like the larger score for cat,

1180
00:48:44,232 --> 00:48:48,583
indicates a larger probability
of that input X being cat.

1181
00:48:48,583 --> 00:48:49,827
And now, a question?

1182
00:48:49,827 --> 00:48:53,494
[student asking a question]

1183
00:48:55,380 --> 00:48:56,717
Sorry, can you repeat that?

1184
00:48:56,717 --> 00:48:58,863
[student asking a question]

1185
00:48:58,863 --> 00:49:01,524
Oh, so the question is what is the three?

1186
00:49:01,524 --> 00:49:04,986
The three, in this example,
corresponds to the three color

1187
00:49:04,986 --> 00:49:06,943
channels, red, green, and blue.

1188
00:49:06,943 --> 00:49:08,469
Because we typically work on color images,

1189
00:49:08,469 --> 00:49:12,636
that's nice information that
you don't want to throw away.

1190
00:49:15,323 --> 00:49:17,243
So, in the k-nearest neighbor setup

1191
00:49:17,243 --> 00:49:18,999
there was no parameters, instead,

1192
00:49:18,999 --> 00:49:21,686
we just kind of keep around
the whole training data,

1193
00:49:21,686 --> 00:49:22,657
the whole training set,

1194
00:49:22,657 --> 00:49:24,092
and use that at test time.

1195
00:49:24,092 --> 00:49:25,825
But now, in a parametric approach,

1196
00:49:25,825 --> 00:49:28,196
we're going to summarize our
knowledge of the training data

1197
00:49:28,196 --> 00:49:31,105
and stick all that knowledge
into these parameters, W.

1198
00:49:31,105 --> 00:49:33,607
And now, at test time, we
no longer need the actual

1199
00:49:33,607 --> 00:49:35,371
training data, we can throw it away.

1200
00:49:35,371 --> 00:49:37,938
We only need these
parameters, W, at test time.

1201
00:49:37,938 --> 00:49:40,561
So this allows our models
to now be more efficient

1202
00:49:40,561 --> 00:49:44,684
and actually run on maybe
small devices like phones.

1203
00:49:44,684 --> 00:49:47,277
So, kind of, the whole
story in deep learning

1204
00:49:47,277 --> 00:49:49,404
is coming up with the
right structure for this

1205
00:49:49,404 --> 00:49:50,906
function, F.

1206
00:49:50,906 --> 00:49:53,451
You can imagine writing down
different functional forms

1207
00:49:53,451 --> 00:49:55,406
for how to combine weights
and data in different

1208
00:49:55,406 --> 00:49:58,608
complex ways, and these
could correspond to different

1209
00:49:58,608 --> 00:50:01,169
network architectures.

1210
00:50:01,169 --> 00:50:02,489
But the simplest possible example

1211
00:50:02,489 --> 00:50:04,132
of combining these two things

1212
00:50:04,132 --> 00:50:05,729
is just, maybe, to multiply them.

1213
00:50:05,729 --> 00:50:08,833
And this is a linear classifier.

1214
00:50:08,833 --> 00:50:13,181
So here our F of X, W is
just equal to the W times X.

1215
00:50:13,181 --> 00:50:15,770
Probably the simplest
equation you can imagine.

1216
00:50:15,770 --> 00:50:16,603
So here,

1217
00:50:16,603 --> 00:50:18,921
if you kind of unpack the
dimensions of these things,

1218
00:50:18,921 --> 00:50:23,088
we recall that our image was
maybe 32 by 32 by 3 values.

1219
00:50:24,871 --> 00:50:28,207
So then, we're going to take
those values and then stretch

1220
00:50:28,207 --> 00:50:31,424
them out into a long column vector

1221
00:50:31,424 --> 00:50:34,715
that has 3,072 by one entries.

1222
00:50:34,715 --> 00:50:38,632
And now we want to end
up with 10 class scores.

1223
00:50:39,746 --> 00:50:41,742
We want to end up with
10 numbers for this image

1224
00:50:41,742 --> 00:50:44,236
giving us the scores for
each of the 10 categories.

1225
00:50:44,236 --> 00:50:46,279
Which means that now our matrix, W,

1226
00:50:46,279 --> 00:50:49,032
needs to be ten by 3072.

1227
00:50:49,032 --> 00:50:51,299
So that once we multiply
these two things out

1228
00:50:51,299 --> 00:50:53,243
then we'll end up with
a single column vector

1229
00:50:53,243 --> 00:50:57,086
10 by one, giving us our 10 class scores.

1230
00:50:57,086 --> 00:51:00,317
Also sometimes, you'll typically see this,

1231
00:51:00,317 --> 00:51:01,910
we'll often add a bias term

1232
00:51:01,910 --> 00:51:04,724
which will be a constant
vector of 10 elements

1233
00:51:04,724 --> 00:51:06,669
that does not interact
with the training data,

1234
00:51:06,669 --> 00:51:09,656
and instead just gives us
some sort of data independent

1235
00:51:09,656 --> 00:51:12,235
preferences for some classes over another.

1236
00:51:12,235 --> 00:51:13,878
So you might imagine that
if you're dataset was

1237
00:51:13,878 --> 00:51:16,300
unbalanced and had many
more cats than dogs,

1238
00:51:16,300 --> 00:51:19,259
for example, then the bias
elements corresponding

1239
00:51:19,259 --> 00:51:23,553
to cat would be higher
than the other ones.

1240
00:51:23,553 --> 00:51:25,445
So if you kind of think about pictorially

1241
00:51:25,445 --> 00:51:27,778
what this function is doing,

1242
00:51:28,930 --> 00:51:31,483
in this figure we have
an example on the left

1243
00:51:31,483 --> 00:51:35,729
of a simple image with
just a two by two image,

1244
00:51:35,729 --> 00:51:37,099
so it has four pixels total.

1245
00:51:37,099 --> 00:51:39,832
So the way that the
linear classifier works

1246
00:51:39,832 --> 00:51:42,273
is that we take this two by two image,

1247
00:51:42,273 --> 00:51:45,202
we stretch it out into a column vector

1248
00:51:45,202 --> 00:51:46,577
with four elements,

1249
00:51:46,577 --> 00:51:49,301
and now, in this example,
we are just restricting to

1250
00:51:49,301 --> 00:51:51,765
three classes, cat, dog, and ship,

1251
00:51:51,765 --> 00:51:54,030
because you can't fit 10 on a slide,

1252
00:51:54,030 --> 00:51:58,197
and now our weight matrix is
going to be four by three,

1253
00:51:59,890 --> 00:52:02,236
so we have four pixels and three classes.

1254
00:52:02,236 --> 00:52:05,567
And now, again, we have a
three element bias vector

1255
00:52:05,567 --> 00:52:08,858
that gives us data independent bias terms

1256
00:52:08,858 --> 00:52:11,125
for each category.

1257
00:52:11,125 --> 00:52:13,467
Now we see that the cat score
is going to be the enter

1258
00:52:13,467 --> 00:52:16,717
product between the pixels of our image

1259
00:52:18,436 --> 00:52:20,650
and this row in the weight matrix

1260
00:52:20,650 --> 00:52:22,814
added together with this bias term.

1261
00:52:22,814 --> 00:52:25,134
So, when you look at it this way

1262
00:52:25,134 --> 00:52:28,146
you can kind of understand
linear classification

1263
00:52:28,146 --> 00:52:30,157
as almost a template matching approach.

1264
00:52:30,157 --> 00:52:32,444
Where each of the rows in this matrix

1265
00:52:32,444 --> 00:52:35,652
correspond to some template of the image.

1266
00:52:35,652 --> 00:52:38,113
And now the enter product or dot product

1267
00:52:38,113 --> 00:52:41,099
between the row of the
matrix and the column

1268
00:52:41,099 --> 00:52:43,183
giving the pixels of the image,

1269
00:52:43,183 --> 00:52:45,165
computing this dot
product kind of gives us

1270
00:52:45,165 --> 00:52:47,874
a similarity between this
template for the class

1271
00:52:47,874 --> 00:52:50,458
and the pixels of our image.

1272
00:52:50,458 --> 00:52:52,906
And then bias just,
again, gives you this data

1273
00:52:52,906 --> 00:52:57,073
independence scaling offset
to each of the classes.

1274
00:53:00,837 --> 00:53:02,401
If we think about linear classification

1275
00:53:02,401 --> 00:53:04,705
from this viewpoint of template matching

1276
00:53:04,705 --> 00:53:07,532
we can actually take the
rows of that weight matrix

1277
00:53:07,532 --> 00:53:09,965
and unravel them back into images

1278
00:53:09,965 --> 00:53:12,799
and actually visualize
those templates as images.

1279
00:53:12,799 --> 00:53:14,734
And this gives us some
sense of what a linear

1280
00:53:14,734 --> 00:53:16,769
classifier might actually be doing

1281
00:53:16,769 --> 00:53:18,908
to try to understand our data.

1282
00:53:18,908 --> 00:53:21,057
So, in this example, we've
gone ahead and trained

1283
00:53:21,057 --> 00:53:23,208
a linear classifier on our images.

1284
00:53:23,208 --> 00:53:25,355
And now on the bottom we're visualizing

1285
00:53:25,355 --> 00:53:28,974
what are those rows in
that learned weight matrix

1286
00:53:28,974 --> 00:53:30,892
corresponding to each of the 10 categories

1287
00:53:30,892 --> 00:53:32,274
in CIFAR-10.

1288
00:53:32,274 --> 00:53:34,117
And in this way we kind
of get a sense for what's

1289
00:53:34,117 --> 00:53:35,686
going on in these images.

1290
00:53:35,686 --> 00:53:38,841
So, for example, in the
left, on the bottom left,

1291
00:53:38,841 --> 00:53:40,978
we see the template for the plane class,

1292
00:53:40,978 --> 00:53:42,941
kind of consists of this like blue blob,

1293
00:53:42,941 --> 00:53:44,856
this kind of blobby thing in the middle

1294
00:53:44,856 --> 00:53:46,739
and maybe blue in the background,

1295
00:53:46,739 --> 00:53:49,212
which gives you the sense
that this linear classifier

1296
00:53:49,212 --> 00:53:51,574
for plane is maybe looking for blue stuff

1297
00:53:51,574 --> 00:53:54,585
and blobby stuff, and those
features are going to cause

1298
00:53:54,585 --> 00:53:57,606
the classifier to like planes more.

1299
00:53:57,606 --> 00:53:59,444
Or if we look at this car example,

1300
00:53:59,444 --> 00:54:02,441
we kind of see that
there's a red blobby thing

1301
00:54:02,441 --> 00:54:04,801
through the middle and a
blue blobby thing at the top

1302
00:54:04,801 --> 00:54:08,721
that maybe is kind of a blurry windshield.

1303
00:54:08,721 --> 00:54:09,654
But this is a little bit weird,

1304
00:54:09,654 --> 00:54:11,271
this doesn't really look like a car.

1305
00:54:11,271 --> 00:54:13,716
No individual car
actually looks like this.

1306
00:54:13,716 --> 00:54:15,628
So the problem is that
the linear classifier

1307
00:54:15,628 --> 00:54:18,317
is only learning one
template for each class.

1308
00:54:18,317 --> 00:54:20,823
So if there's sort of
variations in how that class

1309
00:54:20,823 --> 00:54:21,748
might appear,

1310
00:54:21,748 --> 00:54:24,340
it's trying to average out all
those different variations,

1311
00:54:24,340 --> 00:54:25,658
all those different appearances,

1312
00:54:25,658 --> 00:54:27,461
and use just one single template

1313
00:54:27,461 --> 00:54:29,675
to recognize each of those categories.

1314
00:54:29,675 --> 00:54:31,961
We can also see this pretty
explicitly in the horse

1315
00:54:31,961 --> 00:54:33,139
classifier.

1316
00:54:33,139 --> 00:54:35,776
So in the horse classifier we
see green stuff on the bottom

1317
00:54:35,776 --> 00:54:37,340
because horses are usually on grass.

1318
00:54:37,340 --> 00:54:39,289
And then, if you look
carefully, the horse actually

1319
00:54:39,289 --> 00:54:43,125
seems to have maybe two
heads, one head on each side.

1320
00:54:43,125 --> 00:54:45,855
And I've never seen a
horse with two heads.

1321
00:54:45,855 --> 00:54:48,063
But the linear classifier
is just doing the best

1322
00:54:48,063 --> 00:54:50,513
that it can, because it's
only allowed to learn

1323
00:54:50,513 --> 00:54:52,788
one template per category.

1324
00:54:52,788 --> 00:54:55,085
And as we move forward
into neural networks

1325
00:54:55,085 --> 00:54:56,367
and more complex models,

1326
00:54:56,367 --> 00:54:59,460
we'll be able to achieve
much better accuracy

1327
00:54:59,460 --> 00:55:01,230
because they no longer
have this restriction

1328
00:55:01,230 --> 00:55:05,230
of just learning a single
template per category.

1329
00:55:09,030 --> 00:55:11,223
Another viewpoint of the linear classifier

1330
00:55:11,223 --> 00:55:13,523
is to go back to this idea of images

1331
00:55:13,523 --> 00:55:15,649
as points and high dimensional space.

1332
00:55:15,649 --> 00:55:19,232
And you can imagine
that each of our images

1333
00:55:20,191 --> 00:55:23,328
is something like a point in
this high dimensional space.

1334
00:55:23,328 --> 00:55:26,341
And now the linear classifier
is putting in these

1335
00:55:26,341 --> 00:55:29,305
linear decision boundaries
to try to draw linear

1336
00:55:29,305 --> 00:55:31,402
separation between one category

1337
00:55:31,402 --> 00:55:33,125
and the rest of the categories.

1338
00:55:33,125 --> 00:55:35,838
So maybe up on the upper-left hand side

1339
00:55:35,838 --> 00:55:38,897
we see these training
examples of airplanes

1340
00:55:38,897 --> 00:55:41,446
and throughout the process of training

1341
00:55:41,446 --> 00:55:43,845
the linear classier will
go and try to draw this

1342
00:55:43,845 --> 00:55:46,692
blue line to separate
out with a single line

1343
00:55:46,692 --> 00:55:49,493
the airplane class from all
the rest of the classes.

1344
00:55:49,493 --> 00:55:51,492
And it's actually kind of
fun if you watch during

1345
00:55:51,492 --> 00:55:53,902
the training process these
lines will start out randomly

1346
00:55:53,902 --> 00:55:55,818
and then go and snap into
place to try to separate

1347
00:55:55,818 --> 00:55:57,318
the data properly.

1348
00:55:58,709 --> 00:56:00,921
But when you think about
linear classification

1349
00:56:00,921 --> 00:56:04,000
in this way, from this high
dimensional point of view,

1350
00:56:04,000 --> 00:56:06,768
you can start to see again
what are some of the problems

1351
00:56:06,768 --> 00:56:09,758
that might come up with
linear classification.

1352
00:56:09,758 --> 00:56:11,672
And it's not too hard
to construct examples

1353
00:56:11,672 --> 00:56:15,232
of datasets where a linear
classifier will totally fail.

1354
00:56:15,232 --> 00:56:16,998
So, one example, on the left here,

1355
00:56:16,998 --> 00:56:20,324
is that, suppose we have a
dataset of two categories,

1356
00:56:20,324 --> 00:56:22,067
and these are all maybe
somewhat artificial,

1357
00:56:22,067 --> 00:56:24,990
but maybe our dataset has two categories,

1358
00:56:24,990 --> 00:56:26,095
blue and red.

1359
00:56:26,095 --> 00:56:30,414
And the blue categories
are the number of pixels

1360
00:56:30,414 --> 00:56:33,122
in the image, which are
greater than zero, is odd.

1361
00:56:33,122 --> 00:56:34,944
And anything where the
number of pixels greater

1362
00:56:34,944 --> 00:56:38,714
than zero is even, we want to
classify as the red category.

1363
00:56:38,714 --> 00:56:42,881
So if you actually go and
draw what these different

1364
00:56:43,991 --> 00:56:46,259
decisions regions look like in the plane,

1365
00:56:46,259 --> 00:56:49,907
you can see that our blue class
with an odd number of pixels

1366
00:56:49,907 --> 00:56:53,426
is going to be these two
quadrants in the plane,

1367
00:56:53,426 --> 00:56:56,063
and even will be the
opposite two quadrants.

1368
00:56:56,063 --> 00:56:59,609
So now, there's no way that we
can draw a single linear line

1369
00:56:59,609 --> 00:57:01,364
to separate the blue from the red.

1370
00:57:01,364 --> 00:57:03,412
So this would be an example
where a linear classifier

1371
00:57:03,412 --> 00:57:05,273
would really struggle.

1372
00:57:05,273 --> 00:57:09,535
And this is maybe not such an
artificial thing after all.

1373
00:57:09,535 --> 00:57:10,912
Instead of counting pixels,

1374
00:57:10,912 --> 00:57:13,042
maybe we're actually trying
to count whether the number

1375
00:57:13,042 --> 00:57:16,424
of animals or people in
an image is odd or even.

1376
00:57:16,424 --> 00:57:19,004
So this kind of a parity problem

1377
00:57:19,004 --> 00:57:21,145
of separating odds from evens

1378
00:57:21,145 --> 00:57:22,725
is something that linear classification

1379
00:57:22,725 --> 00:57:25,725
really struggles with traditionally.

1380
00:57:28,376 --> 00:57:31,438
Other situations where a linear
classifier really struggles

1381
00:57:31,438 --> 00:57:33,974
are multimodal situations.

1382
00:57:33,974 --> 00:57:35,146
So here on the right,

1383
00:57:35,146 --> 00:57:39,541
maybe our blue category has
these three different islands

1384
00:57:39,541 --> 00:57:41,915
of where the blue category lives,

1385
00:57:41,915 --> 00:57:44,791
and then everything else
is some other category.

1386
00:57:44,791 --> 00:57:46,529
So, for something like horses,

1387
00:57:46,529 --> 00:57:47,961
we saw on the previous example,

1388
00:57:47,961 --> 00:57:50,106
is something where this
actually might be happening

1389
00:57:50,106 --> 00:57:50,959
in practice.

1390
00:57:50,959 --> 00:57:53,560
Where there's maybe one
island in the pixel space of

1391
00:57:53,560 --> 00:57:55,159
horses looking to the left,

1392
00:57:55,159 --> 00:57:57,268
and another island of
horses looking to the right.

1393
00:57:57,268 --> 00:58:00,360
And now there's no good
way to draw a single linear

1394
00:58:00,360 --> 00:58:03,757
boundary between these two
isolated islands of data.

1395
00:58:03,757 --> 00:58:07,257
So anytime where you have multimodal data,

1396
00:58:08,098 --> 00:58:08,931
like one class

1397
00:58:08,931 --> 00:58:10,854
that can appear in
different regions of space,

1398
00:58:10,854 --> 00:58:13,963
is another place where linear
classifiers might struggle.

1399
00:58:13,963 --> 00:58:15,932
So there's kind of a lot of problems with

1400
00:58:15,932 --> 00:58:18,575
linear classifiers, but it
is a super simple algorithm,

1401
00:58:18,575 --> 00:58:22,259
super nice and easy to interpret
and easy to understand.

1402
00:58:22,259 --> 00:58:24,412
So you'll actually be
implementing these things

1403
00:58:24,412 --> 00:58:27,245
on your first homework assignment.

1404
00:58:28,852 --> 00:58:29,971
At this point,

1405
00:58:29,971 --> 00:58:30,980
we kind of talked about

1406
00:58:30,980 --> 00:58:33,313
what is the functional
form corresponding to a

1407
00:58:33,313 --> 00:58:34,402
linear classifier.

1408
00:58:34,402 --> 00:58:36,711
And we've seen that this functional form

1409
00:58:36,711 --> 00:58:39,185
of matrix vector multiply

1410
00:58:39,185 --> 00:58:41,541
corresponds this idea of template matching

1411
00:58:41,541 --> 00:58:43,364
and learning a single
template for each category

1412
00:58:43,364 --> 00:58:44,922
in your data.

1413
00:58:44,922 --> 00:58:48,339
And then once we have this trained matrix

1414
00:58:49,485 --> 00:58:52,499
you can use it to actually
go and get your scores

1415
00:58:52,499 --> 00:58:55,738
for any new training example.

1416
00:58:55,738 --> 00:58:58,104
But what we have not told you is

1417
00:58:58,104 --> 00:59:00,597
how do you actually go
about choosing the right W

1418
00:59:00,597 --> 00:59:01,951
for your dataset.

1419
00:59:01,951 --> 00:59:03,908
We've just talked about
what is the functional form

1420
00:59:03,908 --> 00:59:06,907
and what is going on with this thing.

1421
00:59:06,907 --> 00:59:11,086
So that's something we'll
really focus on next time.

1422
00:59:11,086 --> 00:59:12,933
And next lecture we'll talk about

1423
00:59:12,933 --> 00:59:14,668
what are the strategies and algorithms

1424
00:59:14,668 --> 00:59:16,581
for choosing the right W.

1425
00:59:16,581 --> 00:59:17,708
And this will lead us to questions

1426
00:59:17,708 --> 00:59:19,544
of loss functions and optimization

1427
00:59:19,544 --> 00:59:21,322
and eventually ConvNets.

1428
00:59:21,322 --> 00:59:25,044
So, that's a bit of the
preview for next week.

1429
00:59:25,044 --> 00:00:00,000
And that's all we have for today.